1

Greetings,

I have the following Apache Lucene snippet that's giving me some nice results:

int numHits=100;
        int resultsPerPage=100;
        IndexSearcher searcher=new IndexSearcher(reader);
        TopScoreDocCollector collector=TopScoreDocCollector.create(numHits,true);
        Query q=parser.parse(queryString);
        searcher.search(q,collector);
        ScoreDoc[] hits=collector.topDocs(0*resultsPerPage,resultsPerPage).scoreDocs;

        Results r=new Results();
        r.length=hits.length;
        for(int i=0;i<hits.length;i++){
            Document doc=searcher.doc(hits[i].doc);
            double distanceKm=getGreatCircleDistance(lucene2double(doc.get("lat")), lucene2double(doc.get("lng")), Double.parseDouble(userLat), Double.parseDouble(userLng));
            double newRelevance=((1/distanceKm)*Math.log(hits[i].score)/Math.log(2))*(0-1);
            System.out.println(hits[i].doc+"\t"+hits[i].score+"\t"+doc.get("content")+"\t"+"Km="+distanceKm+"\trlvnc="+String.valueOf(newRelevance));
        } 

What I want to know, is hits[i].score always between 0 and 1? It seems that way, but I can't be sure. I've even checked the Lucene documentation (class ScoreDocs) to no avail. You'll see I'm calculating the log of the "newRelevance" value, which is based on hits[i].score. I need hits[i].score to be between 0 and 1, because if it is below zero, I'll get an error; above 1 and the sign will change from negative to positive.

I hope some Lucene expert out there can offer me some insight.

Many thanks,

Eamorr
  • 9,872
  • 34
  • 125
  • 209
  • FWIW, [cosine similarity](http://en.wikipedia.org/wiki/Cosine_similarity) is always in [0,1]. Lucene uses a modified form of this, which may deviate in complex ways from the theory. – Xodarap Jan 10 '11 at 15:23

3 Answers3

4

Yes, the score will always be between 0 and 1.

When Lucene calculates the score, it finds individual scores for term hits within fields, etc... and totals them. If the highest ranked hit has a total greater than 1, all of the document scores are normalised to be between 0 and 1, with the highest ranked document having a score of 1. If however no document's total was greater than 1, no normalisation occurs and the scores are returned as-is. This is why sometimes the top document has a score of 1 and other times has a score lower than 1.


EDIT: Having done a bit more research, the answer is most likely no. In the version of Lucene I am familiar with (v2.3.2), searches pass through the Hits object, whose GetMoreDocs() method normalises scores if any of them are greater than 1. In later versions, it appears to be that this is not the case as the Hits class is no longer used. Whether your scores will be between 0 and 1 will depend on which version of Lucene you are using, and which mechanism is being used to search.

To quote from the Lucene mailing list:

The score is an arbitrary number > 0. It's not normalized to anything, it should only be used to e.g. sort the results

adrianbanks
  • 81,306
  • 22
  • 176
  • 206
  • I'm using Lucene 2.9.2. I hope it's between 0 and 1. If the relevance can go over 1, I'll have to look at using something other than logarithms. – Eamorr Jan 10 '11 at 01:18
  • Here's a hopefully better link to the same mailing thread: http://www.lucidimagination.com/search/document/ea3f9d1167fce259/range_score_in_lucene#3969a7cd3b7b7681 . Basically, you are trying to combine the distance with the score, which is a tough problem. I guess you can try custom sorting with some weights and see how it works. – Yuval F Jan 10 '11 at 06:57
1

I believe that Lucene scores are always normalised, i.e. the top-scoring hits get 1 (or near to it). The values should then always be between 0 and 1. By extension, this means that the scores have no objective meaning, i.e. they cannot be compared with anything other than other hits from the same result set.

Disclaimer: I am not a Lucene Scientist. This is based only on my observations of Lucene in action, though, I've never seen this actually documented, so I may have got completely the wrong end of the stick.

skaffman
  • 398,947
  • 96
  • 818
  • 769
  • Thanks for the reply. That's what I thought. I'd like to firm it up with something official though... This is a critical part of my app! – Eamorr Jan 09 '11 at 22:16
0

The scores are between 1 and 0, but the top score does not have to be 1. Scores are always relative to one another, and a direct comparison should not really be made between scores of two different queries.

Joel
  • 29,538
  • 35
  • 110
  • 138