4

This Article describes the LambdaRank algorithm for information retrieval. In formula 8 page 6, the authors propose to multiply the gradient (lambda) by a term called |∆NDCG|. I do understand that this term is the difference of two NDCGs when swapping two elements in the list:

the size of the change in NDCG (|∆NDCG|) given by swapping the rank positions of U1 and U2 (while leaving the rank positions of all other urls unchanged)

However, I do not understand which ordered list is considered when swapping U1 and U2. Is it the list ordered by the predictions from the model at the current iteration ? Or is it the list ordered by the ground-truth labels of the documents ? Or maybe, the list of the predictions from the model at the previous iteration as suggested by Tie-Yan Liu in his book Learning to Rank for Information Retrieval ?

dallonsi
  • 1,299
  • 1
  • 8
  • 29

1 Answers1

1

Short answer: It's the list ordered by the predictions from the model at the current iteration.

Let's see why it makes sense.


At each training iteration, we perform the following steps (these steps are standard for all Machine Learning algorithms, whether it's classification or regression or ranking tasks):

  1. Calculate scores s[i] = f(x[i]) returned by our model for each document i.
  2. Calculate the gradients of model's weights ∂C/∂w, back-propagated from RankNet's cost C. This gradient is the sum of all pairwise gradients ∂C[i, j]/∂w, calculated for each document's pair (i, j).
  3. Perform a gradient ascent step (i.e. w := w + u * ∂C/∂w where u is step size).

In "Speeding up RankNet" paragraph, the notion λ[i] was introduced as contributions of each document's computed scores (using the model weights at current iteration) to the overall gradient ∂C/∂w (at current iteration). If we order our list of documents by the scores from the model at current iteration, each λ[i] can be thought of as "arrows" attached to each document, the sign of which tells us to which direction, up or down, that document should be moved to increase NDCG. Again, NCDG is computed from the order, predicted by our model.

Now, the problem is that the lambdas λ[i, j] for the pair (i, j) contributes equally to the overall gradient. That means the rankings of documents below, let’s say, 100th position is given equal improtance to the rankings of the top documents. This is not what we want: we should prioritize having relevant documents at the very top much more than having correct ranking below 100th position.

That's why we multiply each of those "arrows" by |∆NDCG| to emphasise on top ranking more than the ranking at the bottom of our list.

Chan Kha Vu
  • 9,834
  • 6
  • 32
  • 64