-2

I have a brief understanding of indexing ( inverse indexing ) and scoring ( like tf-idf ) in IR . Generally , if there is no indexing , a tf-idf matrix is pre-calculated , and a corresponding tf-idf vector is made for the query and then scores calculated for each document .

What is this flow like when there is indexing done for the documents , specifically how does a library like apache lucene or terrier process a query to evaluate scores for documents .

95_96
  • 341
  • 2
  • 12

1 Answers1

2

Lucene uses BM25 now, which has a modified slope compared to the old practical tf/idf scoring formula.

When you index documents (putting them into the Lucene index), each field is broken into tokens. How that happens and what is being considered a token depends on the definition of the field. For example if you decide to tokenize on whitespace and apply a lowercase filter, the value "Foo Bar" will be stored as two tokens, foo and bar. If you do not apply any tokenization (or use the KeywordTokenizer) and does not apply any filters, you'll get one token - Foo Bar).

The same process happens when you make a query. The query sent to the field is tokenized and filtered according to the rules for that field, so if you search for fOO bAR in the above example, the query consist of two tokens after processing: foo and bar.

The score for the document is then calculated according to these tokens with the BM25 formula. If you take a look at the formula, you can see that the score is calculated for each token (q), then summed up to get a score for the field.

BM25 scoring formula from Wikipedia

If you add debugQuery=true after a query to Solr, you'll get detailed information about exactly how the score is being calculated.

MatsLindh
  • 49,529
  • 4
  • 53
  • 84