-1

In information retrieval course, I'm supposed to show that ranking documents by tf-idf is the same as ranking them by query likelihood, and then he gave us the equation of ranking the document by query likelihood, the question is very confusing...am I supposed to start by the equation of query likelihood and derive the equation of tf-idf from there, or am I supposed to show that the ranking of the documents stays the same after using both ranking algorithms??? I really need help on this one and I feel like I'm wasting so much time on a very stupid question... really don't wanna hear your opinion in my research abilities, just need clarification, and if you could, an answer would really be helpful because I've wasted enough time on this and I have 3 more assignments that are due in a few days...

Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
Ali Yahya
  • 49
  • 1
  • 6

1 Answers1

3

tf-idf is a very ad-hoc method. Although it is intuitively quite clear, it is not theoretically motivated. More systematic retrieval methodologies such as the Language Modeling (also known as query likelihood) and BM25 establish the tf-idf intuition theoretically.

For your question in particular, you should start with the query-likelihood equation and show that it's mathematically equivalent to the tf-idf case.

Query likelihood returns a ranked list of documents sorted by P(d|q). To estimate P(d|q), make use of the Bayes rule to note that P(d|q) = P(q|d)P(d)/P(q). The denominator is a constant and can hence be ignored in the similarity computation. P(q|d) can be estimated by \prod P(t|d) where t is a term in the query.

Now a query term t can either be chosen from the document d or form the collection. Let \lambda be the probability of choosing a term from the document. More specifically,

P(t|d) = \lambda tf(t,d)/len(d) + (1-\lambda) cf(t)/cs
P(q|d) = \prod P(t|d)

where tf(t,d) is the freq. of term t in document d, len(d) is the length of document d, cf(t) is the number of times t occurs in the collection, and cs is the total number of words in the collection.

Since the latter part of the sum is independent of the document d, you can divide the equation by the latter term and take log to get

log P(q|d) = \sum log (1 + \lambda/(1-\lambda) (tf(t,d)/len(d)) * (cs/cf(t)) )

       = \sum log (1 + \lambda/(1-\lambda) tf * idf)    
Debasis
  • 3,680
  • 1
  • 20
  • 23