Is it possible to find common words in specific Lucene documents? -
for example:
doc1 = "i got new apple iphone 8"; doc2 = "have seen new apple iphone 8?"; doc3 = "the apple iphone 8 out"; doc4 = "another doc without common words"; find_commons(["doc1", "doc2", "doc3", "doc4"]);
results: {{"doc1", "doc2", "doc3"}, {"apple", "iphone"}}
or similar
other question: there better library/system achieve using lucene's data?
yes, can use termvector
retrieve information.
first, need make sure termvectors stored in index, e.g.:
private static document createdocument(string title, string content) { document doc = new document(); doc.add(new stringfield("title", title, field.store.yes)); fieldtype type = new fieldtype(); type.settokenized(true); type.setstoretermvectors(true); type.setstored(false); type.setindexoptions(indexoptions.docs_and_freqs_and_positions_and_offsets); doc.add(new field("content", content, type)); return doc; }
then, can retrieve term vector given document id:
private static list<string> gettermsfordoc(int docid, string field, indexreader reader) throws ioexception { list<string> result = new arraylist<>(); terms terms = reader.gettermvector(docid, field); termsenum = terms.iterator(); for(bytesref br = it.next(); br != null; br = it.next()) { result.add(br.utf8tostring()); } return result; }
finally can retrieve common terms 2 documents:
private static list<string> getcommonterms(int docid1, int docid2, indexsearcher searcher) throws ioexception { // using field "content" example here. list<string> termlist1 = gettermsfordoc(docid1, "content", searcher); list<string> termlist2 = gettermsfordoc(docid2, "content", searcher); termlist1.retainall(termlist2); return termlist1; }
of course can expanded allow arbitrary number of documents.
Comments
Post a Comment