The formulas for relevance scores are documented in the MarkLogic Search Guide:
The logtfidf
method (the default scoring method) uses the following formula to calculate relevance:
log(term frequency) * (inverse document frequency)
The inverse document frequency is defined as:
log(1/df)
It seems that the Knowledgebase Article shows the formula for inverse document frequency
when discussing logtfidf
, which might be a little confusing. The intent was to introduce and explain term frequency normalization
and the options that are available to customize the score calculation beyond just the logtfidf
or inverse document frequency
calculation.
With term frequency normalization
you can influence the relevance score with the term frequency normalization setting, which takes into account the size of the document and the "density" of the terms relative to other documents in the database:
The scoring methods that take into account term frequency (score-logtfidf
and score-logtf
) will, by default, normalize the term frequency (how many search term matches there are for a document) based on the size of the document. The idea of this normalization is to take into account how frequent a term occurs in the document, relative to the other documents in the database. You can think of this is the density of terms in a document, as opposed to simply the frequency of the terms. The term frequency normalization makes a document that has, for example, 10 occurrences of the word "dog" in a 10,000,000 word document have a lower relevance than a document that has 10 occurrences of the word "dog" in a 100 words document. With the default term frequency normalization of scaled-log, the smaller document would have a higher score (and therefore be more relevant to the search), because it has a greater term density of the word "dog". For most search applications, this behavior is desirable.