I have a Lucene index containing 5 million entries. I query this index with "distorted" snippets of the indexed documents. Then, I get the top 1 document and its score. From those data, I need to tell if the returned document is correct. My first approach was to train a Random Forest using the id of the returned document and also the score (that means, for each searched snippet, I insert a training instance into the Random Forest containing the returned data). However, although it has been quite effective for some documents, it has performed poorly for others.
For every document, the query against the Lucene index has been able to find the correct document for some snippets, but not for others (which leaves me with a 100% recall but a low precision).
How can I set an effective heuristic for telling which results are correct?