2

I understand that the default term frequency (tf) is simply calculated as the sqrt of number of times a particular term being searched appears in a field. So documents containing multiple occurences of a term you are searching on will have a higher tf and hence weight.

What I'm unsure about is whether this helps increase the documents score because the weight is higher or reduces the documents score because its move the document vector away from the query vector as the book Hibernate Search in Action seems to be saying (pg 363). I confess I'm really struggling to see how the document vector model fits in with lucene scoring equation

Goyuix
  • 23,614
  • 14
  • 84
  • 128
Paul Taylor
  • 13,411
  • 42
  • 184
  • 351

1 Answers1

1

I don't have this book to check, but basically (if we ignore the different boosts that can be set manually at indexing time), there are three reasons why the score of some document may be higher (or lower) than the score of other documents with Lucene's default scoring model and for a given query:

  • the queried term has a low document frequency (boosting the IDF part of the score),
  • the queried term has a high number of occurrences in the document (boosting the TF part of the score),
  • the queried term appears in a rather small field of the document (boosting the norm part of the score).

This means that for two documents D1 and D2 and one queried term T, if

  • T appears n times in D1,
  • T appears p > n times in D2,
  • the queried field of D2 has (almost) the same size (number of terms) as D1,

D2 will have a better score than D1.

jpountz
  • 9,904
  • 1
  • 31
  • 39
  • Thanks that is how I originally understood it, but I need a little bit more, how does this scoring fit into the vector space model I dont see it. – Paul Taylor Mar 08 '12 at 08:34
  • Lucene doesn't strictly use the VSM but a combination of the VSM and of the Boolean model. However, for a disjunctive query, the VSM applies. Wikipedia has a very nice article explaining how the TF-IDF scoring applies to the VSM http://en.wikipedia.org/wiki/Vector_space_model#Example:_tf-idf_weights – jpountz Mar 08 '12 at 09:50
  • Sorry I've read the link a couple of times but I still dont get how this fits in with the Lucene equation. I know Lucene uses a Boolean model to weed out docs that dont match any terms, but cant see when it compares a docs vector with a query vector it just seems to do tf*idf*norm for each matching term in docs that match the query and take the highest score. Also could you expand on your point about disjunctive querys as I am trying to implement a version of this. – Paul Taylor Mar 08 '12 at 10:04
  • This is not true, it computes tf*idf*norm for every matching doc and then returns the sum (this is a simplification, only true when all query terms are distinct, but anyway this is not the max) for every term in the query as the final score. My point about disjunctive queries (terms separated by a OR) was that this is when Lucene is the closest to the VSM. All you need to know about Lucene scoring is in the Similarity doc http://lucene.apache.org/core/old_versioned_docs/versions/3_5_0/api/all/org/apache/lucene/search/Similarity.html – jpountz Mar 08 '12 at 10:23