8

I need to normalize the Lucene scores between 0 and 1.

For example, a random query returns the following scores...

8.864665
2.792687
2.792687
2.792687
2.792687
0.49009037
0.33730242 
0.33730242 
0.33730242 
0.33730242 

What's the biggest score ? 10.0 ?

thanks

aneuryzm
  • 63,052
  • 100
  • 273
  • 488

6 Answers6

10

You can divide all scores with the maximum score to get scores between 0 and 1.

However, please note that the normalised scores should be used to compare the results of a single query only. It is not correct to compare the scores (normalised or not) of results from 2 different queries.

nikhil500
  • 3,458
  • 19
  • 23
  • @nikhil500 really ? So if I have a bunch of queries, how can I see which ones are performing better ? – aneuryzm Mar 21 '11 at 16:01
  • Please post some more details of how (and why) you want to compare the results of multiple queries. Scores across queries are not directly comparable, but depending on your exact problem, we may be able to come up with some solution. – nikhil500 Mar 22 '11 at 01:45
  • @nikhil500 My issue is that for each query I have to combine multiple scores (coming from other software) and they are all normalized (between 0 and 1) except for Lucene scores. – aneuryzm Mar 22 '11 at 05:55
  • Do you want to reorder the results from Lucene based on scores coming from other sources, or do you want to merge results from other sources with the Lucene results? If you want to reorder, then just go ahead and multiply the Lucene score with the external score. However, if you want to merge results from external sources with Lucene results, then it gets much more complicated - you need to somehow figure out a 'normalization factor' since it will be incorrect to assume that the top document from a Lucene result set is always scored 1 on a scale of 0 to 1. – nikhil500 Mar 22 '11 at 06:37
  • @nikhil500 The second one. And my question is how to do it, indeed. Should I consider the query with the highest score and use that score for the normalization ? I need some help here – aneuryzm Mar 22 '11 at 06:46
  • Please edit your question and add some more info about what exactly you are trying to do - it is not clear why and how you want to merge results from multiple queries (Your question talks about normalizing scores of a single Lucene result set). Do you need to merge Lucene results for multiple queries with external results? If you can give some details of the exact use case, then I can try to help solve the problem. – nikhil500 Mar 22 '11 at 07:05
  • @nikhil500 I don't want to merge results for multiple queries. I want to merge Lucene results with other software results per each query. i.e. I have query1 Lucene: 8.1 Score2: 0.98 Score3: 0.754 – aneuryzm Mar 22 '11 at 07:14
  • The problem is actually simple, I need to assign a correct weight to Lucene scores or normalize Lucene results in order to avoid unbalanced results when I combine the scores. – aneuryzm Mar 22 '11 at 07:15
  • In that case, just multiply the 3 scores and sort the results based on that. There is no need to normalize the Lucene scores. (You can normalize if you want to, the final ordering will not change) – nikhil500 Mar 22 '11 at 08:44
  • @nikhil500 ok. One last thing, if I want to measure the difference between retrieval results using Lucene only and combined scores, than the normalization is necessary right ? I mean, the other scores have lower influence if Lucene scores are not normalized... – aneuryzm Mar 22 '11 at 08:54
  • No, if you multiply the 3 scores then normalization will not have any effect on the final order of the results or the relative scores of the records in the result set. – nikhil500 Mar 22 '11 at 09:27
5

There is no good standard way to normalize scores with lucene. Read this: ScoresAsPercentages and this explanation

In your case the highest score is the score of the first result, if the results are sorted by score. But this score will be different for every other query.

See also how-do-i-normalise-a-solr-lucene-score

Community
  • 1
  • 1
morja
  • 8,297
  • 2
  • 39
  • 59
  • My issue is that I have lucene scores + other scores (not related to Lucene) for each query results. The other scores are all normalized between 1 and 0. If I don't normalize Lucene scores in the same way I'm going to have unbalanced results... – aneuryzm Mar 21 '11 at 16:08
  • Have a look at http://lucene.apache.org/java/2_9_2/api/core/org/apache/lucene/search/Collector.html class. You might have to write your own Collector. Maybe using your other scores, or a combination. – morja Mar 21 '11 at 16:21
1

There is no maximum score in Solr, it depends on too many variables, so it can't be predicted.

But you can implement something called normalized score (Scores As Percentages) which is not recommended.

See related links for more details:

Is it possible to set a Solr Score threshold 'reasonably', independent of results returned? (i.e. Is Solr Scoring standardized in any way)

how do I normalise a solr/lucene score?

Remove results below a certain score threshold in Solr/Lucene?

Community
  • 1
  • 1
kenorb
  • 155,785
  • 88
  • 678
  • 743
0

A regular normalization will only help you to compare the scoring distribution among queries (and theirs retrieved lists). You cannot simply normalize the score to compare the performance between queries. Think of a query which all retrieved documents are highly relevant and received the same (high score), and on another query that the retrieved list comprise barley relevant document (again, with the same score) - now, no matter the per-query normalization you make - the normalized score will be the same.

You need to think on a cross-query factor that can bring all the scores to the same level.

For example - maybe computing similarity between the query and the whole index, and use that score somehow along with the document-score

0

If you want to compare two or more queries, i found an workaround. You can compare your highest scored document with your queryterm using the LevenstheinDistance or LuceneLevenstheinDistance(Damerau) class to get the distance between your queryterm and your result. The result is the similiarity between them. Do this for each query you want to compare against. Now you have a tool to compare your queries using the similiarity of your querytherm and your highest result. You can now choose the query with the highest score of similiarity and use this for next proper actions.

    //Damerau LevenstheinDistance
    LuceneLevenshteinDistance d = new LuceneLevenshteinDistance();

    similiarity = d.getDistance(queryterm, yourResult );
0

I applied a non-linearity function in order to compress every queries.