This is a very fundamental and silly doubt. I have read that in order to prevent large relevance assessments in TREC competitions (reference), the top-ranked documents returned by participating systems are pooled to create the set of documents for relevance assessment. However, my doubt is this:
If majority of the systems use a common model or a similar model with somewhat same parameters. For example if several systems use LSA with rank reduced to 100,120,150,105, etc. Then there are two problems. One, merging such results might not really give the documents relevant to each query as the documents returned might severely overlap. Two, the documents which are to be assessed are actually biased as per the models used by the participating systems. So the relevance judgements will not really be method agnostic.
I know I am missing something here and if anyone could guide me in finding the missing link it would be really helpful!