Is it possible to find common words in specific Lucene documents? -


for example:

doc1 = "i got new apple iphone 8"; doc2 = "have seen  new apple iphone 8?"; doc3 = "the apple iphone 8 out"; doc4 = "another doc without common words";  find_commons(["doc1", "doc2", "doc3", "doc4"]); 

results: {{"doc1", "doc2", "doc3"}, {"apple", "iphone"}} or similar

other question: there better library/system achieve using lucene's data?

yes, can use termvector retrieve information.

first, need make sure termvectors stored in index, e.g.:

private static document createdocument(string title, string content) {     document doc = new document();      doc.add(new stringfield("title", title, field.store.yes));     fieldtype type = new fieldtype();     type.settokenized(true);     type.setstoretermvectors(true);     type.setstored(false);     type.setindexoptions(indexoptions.docs_and_freqs_and_positions_and_offsets);     doc.add(new field("content", content, type));      return doc; } 

then, can retrieve term vector given document id:

private static list<string> gettermsfordoc(int docid, string field, indexreader reader) throws ioexception {     list<string> result = new arraylist<>();      terms terms = reader.gettermvector(docid, field);     termsenum = terms.iterator();     for(bytesref br = it.next(); br != null; br = it.next()) {         result.add(br.utf8tostring());     }      return result; } 

finally can retrieve common terms 2 documents:

private static list<string> getcommonterms(int docid1, int docid2, indexsearcher searcher) throws ioexception {     // using field "content" example here.     list<string> termlist1 = gettermsfordoc(docid1, "content", searcher);     list<string> termlist2 = gettermsfordoc(docid2, "content", searcher);      termlist1.retainall(termlist2);     return termlist1; } 

of course can expanded allow arbitrary number of documents.


Comments

Popular posts from this blog

php - Permission denied. Laravel linux server -

google bigquery - Delta between query execution time and Java query call to finish -

python - Pandas two dataframes multiplication? -