9

I have a Solr index with many entries, and upon query some subset is returned - each entry having some score, (Obvious). Once the results are returned with scores, I want to be able to only "keep" results that are above some score (i.e. results of a certain quality only). Is it possible to do this when the returned subset could be anything?

I ask because it seems like on some queries a score of say 0.008 is resulting in a decent match, whereas other queries a higher score results in a poor match.

Ideally I'm just looking for a method to take the top x entries as long as they are of at least a certain quality.

Braiam
  • 1
  • 11
  • 47
  • 78
DJSunny
  • 1,970
  • 3
  • 19
  • 27
  • 1
    See also: http://stackoverflow.com/questions/5379176/how-to-normalize-lucene-scores http://stackoverflow.com/questions/3986220/how-do-i-normalise-a-solr-lucene-score http://stackoverflow.com/questions/2871558/remove-results-below-a-certain-score-threshold-in-solr-lucene/15765203 – kenorb Apr 02 '13 at 13:31

2 Answers2

5

I think you should not do this. With the TF-IDF scoring model, there is no way to compute a score above which all results are relevant and vice-versa. And if you manage to do this, it is very likely that this threshold will not be valid anymore after a few updates to your index (because document frequencies will change).

If you still want to do this, I think it is achievable using function queries : there are a if (in trunk), and a query functions available in Solr. Just filter your results so that you only keep entries which have a higher score than a given threshold.

jpountz
  • 9,904
  • 1
  • 31
  • 39
  • +1 for "... compute a score above which all results are relevant" – Jesvin Jose Nov 23 '11 at 11:35
  • Thanks!, Do you have a recommended method of "sifting" best results? Something along the lines of @Jayendra solution of dividing by maxScore. – DJSunny Nov 23 '11 at 15:33
  • I don't have one, because there is no good way of doing this. Even by rewriting scores as percentages, you will get deceptive results. However, if you are using pure disjonctive queries, you might be interested in the 'minimum should match' parameter of (E)DisMaxQueryParser which allows you to ensure that, for example, at least 75% of the clauses must match for a document to be included in the results. – jpountz Nov 23 '11 at 15:58
3

Would also like to go through ScoresAsPercentages first.

Solr does not normalize scores since it may be easily done at the client side.
you can use the maxScore which is provided in the results, by dividing all scores by maxScore.
The first record will have the score of one followed by the rest.

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • I've read the ScoresAsPercentages doc before - and they're heavy suggestion to not do such a thing. How do you feel dividing by maxScore "works"? That is, does it provide a meaningful comparison of the results, or is it not great. Thanks for the answer. – DJSunny Nov 23 '11 at 15:35
  • Dividing by maxscore you should be able to filter out the results and range them. however, it still will not guarantee that the document with the maxscore is relevant. – Jayendra Nov 23 '11 at 16:31