3

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

  1. "boost" that shorter length posts - using doc.getBoost()
  2. "lengthNorm" in the definition of norm(t,d)

Here is the documentation

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

Community
  • 1
  • 1
vir
  • 31
  • 2
  • Look into custom Similarity implementations (derive from DefaultSimilarity and override LengthNorm, Tf, Idf and other functions used for score calculations), it may help you to understand the process further. – sisve Sep 01 '12 at 06:05
  • We had the same effect and it worked well with Field.setOmitNorms(true) setting the similarity to searcher.setSimilarity(new DefaultSimilarity() { @Override public float tf(float freq) { return 1; } }); this switched off counting terms and taking document length into account. – fricke Oct 17 '14 at 21:37

2 Answers2

1

Using BM25Similarity you could reduce to 0f:

@param b Controls to what degree document length normalizes tf values

or

@param k1 Controls non-linear term frequency normalization (saturation).

Both params will affect SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

0

Shorter docs are meant to be more relevant when you use TF-IDF scoring.

You can use your custom scoring functions in Lucene. Its easy to customize the scoring algorithm. Subclass DefaultSimilarity and override the method you want to customize.

There's a code sample here that will help you implement it

Rishi Dua
  • 2,296
  • 2
  • 24
  • 35