0

I have a Lucene index containing 5 million entries. I query this index with "distorted" snippets of the indexed documents. Then, I get the top 1 document and its score. From those data, I need to tell if the returned document is correct. My first approach was to train a Random Forest using the id of the returned document and also the score (that means, for each searched snippet, I insert a training instance into the Random Forest containing the returned data). However, although it has been quite effective for some documents, it has performed poorly for others.

For every document, the query against the Lucene index has been able to find the correct document for some snippets, but not for others (which leaves me with a 100% recall but a low precision).

How can I set an effective heuristic for telling which results are correct?

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
Felipe Martins Melo
  • 1,323
  • 11
  • 15

1 Answers1

0

If I get your question correctly, you want to retrieve that document whose distorted form is the current query. This case is then similar to the near duplicate detection problem which is typically solved with word level n-grams (called shingles). The Jaccard coeff. of the set of shingles is an effective way of determining this. For more details refer to Andrei Broder's paper on near duplicate document detection..

Debasis
  • 3,680
  • 1
  • 20
  • 23
  • Thanks Debasis. The query is just a distorted snippet, not the document itself. In fact, instead of indexing the original content, I removed the stop words, created 2-grams of the remaining terms and generated the index. The snippets are collected from a streaming source and similarly have the stop words removed and 2-grams generated. It performs quite well on what regards the recall (for each streamed document, I'm able to detect it correctly for at least one snippet). My problem is precision. I need to be able to tell, with good precision, if the top1 document is correct for a given snippet. – Felipe Martins Melo Dec 03 '14 at 13:45