Pooling Method in TREC competitions

Question

This is a very fundamental and silly doubt. I have read that in order to prevent large relevance assessments in TREC competitions (reference), the top-ranked documents returned by participating systems are pooled to create the set of documents for relevance assessment. However, my doubt is this:

If majority of the systems use a common model or a similar model with somewhat same parameters. For example if several systems use LSA with rank reduced to 100,120,150,105, etc. Then there are two problems. One, merging such results might not really give the documents relevant to each query as the documents returned might severely overlap. Two, the documents which are to be assessed are actually biased as per the models used by the participating systems. So the relevance judgements will not really be method agnostic.

I know I am missing something here and if anyone could guide me in finding the missing link it would be really helpful!

score 3 · Answer 1 · answered Nov 15 '21 at 17:25

Yes, those problems are possible, but tend not to matter in practice if the set of runs is diverse enough and the pool depth is deep enough. Justin Zobel examined this problem way back in 1998: https://ir.webis.de/anthology/1998.sigirconf_conference-98.38/

The TREC overview papers from TREC-7 and TREC-8 also give lots of details about the pools created for the early TREC ad hoc collections (TREC proceedings papers are posted in the Publications section of the trec web site, trec.nist.gov). We have also documented cases where pooling was not successful. See "Bias and the Limits of Pooling" https://www.nist.gov/publications/bias-and-limits-pooling-large-collections

Building large test collections that are fair, general-purpose, and affordable is an on-going research problem.

Ellen Voorhees

NIST

Thanks Ellen for the pointers... – Debasis Nov 23 '21 at 13:07 — Debasis, Nov 23 '21 at 13:07

score 1 · Accepted Answer · answered Nov 15 '21 at 15:32

You are correct. Pooling has got its own problems and we have to live with it.

There're, however, ways of making the pooling process less biased towards a set of specific retrieval models.

Using a set of diverse retrieval models and different retrieval settings (e.g. using the title or the title and description as queries) often helps in reducing the overlap in the retrieved set of documents. The overlap isn't always a bad thing either because ending up retrieving a document in multiple lists (corresponding to different settings or retrieval models) may actually reinforce the belief of including this document in the pool.

Another approach that was followed in TREC was to encourage participating systems to include manually post-processed runs, in order to ensure that the documents being shown to the assessors involve some kind of a manual filtering instead of them being outputs of purely automated algorithms.

While it is true that the top-retrieved set is a function of a specific retrieval model, the idea that pooling uses is that with sufficient depth (say depth-100), it is highly unlikely that a document that's truly relevant would not be retrieved within the top-100 of any retrieval model. So, the higher number of settings (models and query formulation strategies) one uses and the higher the depth is, the lower becomes this probability of missing a truly relevant document.

However, it's certainly possible to extend the assessment pool for a retrieval model with characteristics completely different from the existing ones using which the initial pool was constructed.

Thank for your answer. Could you please elaborate on the last point: "However, it's certainly possible to extend the assessment pool for a retrieval model with characteristics completely different from the existing ones using which the initial pool was constructed." Also in manually post processed runs, are the participants deleting and adding irrelevant and relevant documents? — Arup Das, Nov 15 '21 at 17:34
Example: Consider the TREC evaluations in the 90's didn't include neural approaches... so if a neural approach is indeed able to find a 'new' relevant document, the evaluation will penalize this because this document wasn't even judged (but then no judgment means non-relevance!)... so, ideally the pool may need to be reconstructed for ranking methods that are very different from the ones that were used to construct the pool... hope this is clear now? yes, in the manually processed runs, they manually removing non-relevant documents from the top-1000... — Debasis, Nov 18 '21 at 11:30
@Arup If an answer helps you to solve your problem then it's a usual practice in stackoverflow to upvote the answer and also accept it as the correct answer... — Debasis, Nov 22 '21 at 15:55

Pooling Method in TREC competitions

2 Answers2