0

Is it possible to normalize the score before the boosts gets applied?

Let's say I have 2 documents

Doc1:

text: xxxx

description: xxxx

number_published_in_this_year: 1

Doc2:

text: xxxx

description: xxxx

number_published_in_this_year: 10

Now when I search, assume if q=cookie&qf=title^10 description^10&bf=number_published_in_this_year^5

Assume tf-idf scores are as below:

Doc1: title - 4 description - 5

Doc2: title - 2.5 description - 1.5

With the normal approach, final score calculation would be

Final scores:

Doc1: 4*10 + 5*10 + 1*5 = 95

Doc2: 2.5*10 + 1.5*10 + 10*5 = 90

The idea is to normalize the score, so that text matching scores will not dominate other factors. ( in this case, number_published_in_this_year is much larger in case of the second document)

Doc1: 4/5 *10 + 5/5 *10 + 1*5 = 23

Doc2: 2.5/5 *10 + 1.5/5 *10 + 10*5 = 58

(or)

Doc1: 90/90 + 1*5 = 6

Doc2: 40/90 + 10*5 = 50.4

Now since doc2 has higher score, it will come on top.

Is this possible? Can someone help me on this?

  • What would this change? What are you actually trying to achieve? "Normalizing" the score before applying the boost for all documents doesn't actually change anything except for the spacing between the documents scores. – MatsLindh May 24 '20 at 08:51
  • (the relative distance between each document remains the same - it's just the absolute value that changes) – MatsLindh May 24 '20 at 09:09
  • Thanks Mat.I am trying to keep the final score between 0-1. – user2163880 May 24 '20 at 11:29
  • .. but what problem are you trying to solve by doing that? There usually isn't a reason to keep the absolute score within a specific range. What use case are you trying to solve? – MatsLindh May 24 '20 at 11:32
  • Mat, I updated my question. The idea is that text matching scores should not dominate other factors. ( in this case, number_published_in_this_year is much larger in case of the second document) – user2163880 May 24 '20 at 13:21
  • The same would be achieved by using `boost` instead of `bf` - `boost` is multiplicative, so that the boost factor is multiplied into the score. In that case your `number_published_in_this_year` would directly affect the score, instead of just being a factor. Your example would then be doc2: 400, doc1: 90 – MatsLindh May 24 '20 at 15:04

0 Answers0