java Lucene best match is not an exact match

Question

Lucene scoring seems to completely elude my understanding.

I have a set of documents for the following:

Senior Education Recruitment Consultant
Senior IT Recruitment Consultant
Senior Recruitment Consultant

These have been analysed using EnglishAnalyzer.

The search query is built with a QueryParser using EnglishAnalyzer as well.

When I search for Senior Recruitment Consultant every one of the above documents are returned with the same score, where the desired (and expected) result would be Senior Recruitment Consultant as the top result.

Is there a straightforward way of achieving the desired behaviour that I've missed?

Here is my debugging output:

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22157) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  2.3421772 = (MATCH) weight(Title:recruit in 22157) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)
  1.2005073 = (MATCH) weight(Title:consult in 22157) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22157,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22157, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22157)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22292) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  2.3421772 = (MATCH) weight(Title:recruit in 22292) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)
  1.2005073 = (MATCH) weight(Title:consult in 22292) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22292,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22292, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22292)

4.6491017 = (MATCH) sum of:
  1.1064172 = (MATCH) weight(Title:senior in 22494) [DefaultSimilarity], result of:
    1.1064172 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.4878372 = queryWeight, product of:
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.10754765 = queryNorm
      2.268005 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.53601 = idf(docFreq=818, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  2.3421772 = (MATCH) weight(Title:recruit in 22494) [DefaultSimilarity], result of:
    2.3421772 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.70978254 = queryWeight, product of:
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.10754765 = queryNorm
      3.2998517 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        6.5997033 = idf(docFreq=103, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)
  1.2005073 = (MATCH) weight(Title:consult in 22494) [DefaultSimilarity], result of:
    1.2005073 = score(doc=22494,freq=1.0 = termFreq=1.0
), product of:
      0.50815696 = queryWeight, product of:
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.10754765 = queryNorm
      2.3624735 = fieldWeight in 22494, product of:
        1.0 = tf(freq=1.0), with freq of:
          1.0 = termFreq=1.0
        4.724947 = idf(docFreq=677, maxDocs=28116)
        0.5 = fieldNorm(doc=22494)


Senior Education Recruitment Consultant 4.6491017
Senior IT Recruitment Consultant 4.6491017
Senior Recruitment Consultant 4.6491017

femtoRgon · Answer 1 · 2015-04-09T16:10:08.350

The only scoring element you have to rely on is the lengthnorm.

Lengthnorm is stored with the document at index time, along with the field's boost. It serves to score shorter documents a bit higher.

So why isn't it working? You have two problems:

First: Norms are stored with an extremely lossy compression. They occupy only a single byte, and have about 1 significant decimal digit of precision. So, basically, the difference isn't quite big enough to impact the score.

On the rationale for this lossiness, from the DefaultSimilarity documentation:

...given the difficulty (and inaccuracy) of users to express their true information need by a query, only big differences matter.

Second: "IT" is a stop word in english. You mean "Information Technology", but all the analyzer sees is the common english pronoun. And no matter how many stop words you throw into the field, they won't impact the lengthnorm.

Here's a test showing some results I came up with:

Senior Education Recruitment Consultant ::: 0.732527
Senior IT Recruitment Consultant ::: 0.732527
Senior Recruitment Consultant ::: 0.732527
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.732527
Senior Education Recruitment Consultant Of Justice ::: 0.64096117
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.3662635

As you see, with "Senior Education Recruitment Consultant Of Justice" we add just one more search term, and lengthnorm starts making the difference. But with "if and but Senior IT IT IT IT IT Recruitment this that Consultant" will still see no difference, because all of the added terms are common english stop words.

The solution: You could fix the norm precision issue with a custom similarity implementation that wouldn't be all that difficult to code (copy DefaultSimilarity, and implement a non-lossy encodeNormValue and decodeNormValue). You could also set up the analyzer with a custom, or empty, stop word list (via the EnglishAnalyzer ctor).

However, that might be throwing the baby out with the bathwater. If it's really important that precise matches be scored higher, you might be better served by expressing that with your query, like this:

\"Senior Recruitment Consultant\" Senior Recruitment Consultant

Results:

Senior Recruitment Consultant ::: 1.465054
Senior Recruitment Consultant and some other nonsense we don't want to know about ::: 0.732527
Senior Education Recruitment Consultant ::: 0.27469763
Senior IT Recruitment Consultant ::: 0.27469763
if and but Senior IT IT IT IT IT Recruitment this and that Consultant ::: 0.27469763
Senior Education Recruitment Consultant Of Justice ::: 0.24036042

Thanks for the concise reply. Could you elaborate a little more on the query that you suggest towards the end? Is that the entire query? Is the first part to be indexed and the second the query? Cheers. — timsworth, Apr 09 '15 at 16:15
That's the whole query, assuming the default field for your query parser can be used here. Essentially, it's combining a phrase query with three simple term queries. You could think it is as: `\"Senior Recruitment Consultant\" OR Senior OR Recruitment OR Consultant`. You can take a look at my test implementation if it would help: http://pastebin.com/YRMCdWaV (using Lucene 4.10.2). — femtoRgon, Apr 09 '15 at 16:22

score 0 · Answer 2 · edited May 23 '17 at 10:24

0

Normal lucene ranking is frequency based, and distance between words is not taken into account.

BUT, you can add proximity search term, which requires words within predefiend distance to do the trick (however you kind of need to know how many words are in your query.

There is the answer to similar problem on SO Lucene.Net: Relevancy by distance between words

edited May 23 '17 at 10:24

Community

1
1

answered Apr 09 '15 at 15:02

Zielu

8,312
4
28
41

The above example I gave is not the entire use-case and much more likely than not, exact string queries will not happen. For example, 'Senior Recruitment Consultant (Existing Clients) Manchester' could be searched in this instance, which coincidentally has the same problem. – timsworth Apr 09 '15 at 15:15
first try if "Senior Recruitment Consultant (Existing Clients)"~10 works on multiple words I belive it does, then you just have to OR it with individual words so it will not exclude results which are not next to each other. You would need to tokenize your search phrase and create a new one before you pass it to the lucene. – Zielu Apr 09 '15 at 15:32
Own query builder is actually not too dificult to write, but you would need to serach the code for the existing QueryParser to see which Query type is used to represent the "(Existing Clients)"~10 search, as I could not easily google it. – Zielu Apr 09 '15 at 15:34
I tried "Senior Recruitment Consultant (Existing Clients)"~10 and in this instance the result is the same. – timsworth Apr 09 '15 at 15:41
By tried I meant does correctly work on multiple phrase words. If you got search results it does. Use "Existing Clients"~0 to get exact matches (btw the "" are important) and OR this phrase with diffused one. – Zielu Apr 09 '15 at 15:50

java Lucene best match is not an exact match

2 Answers2