9

What is the fastest way to count all results for a given Query in Lucene?

  1. TopDocs.totalHits
  2. implement and manage a Filter, using QueryFilter
  3. implement a custom 'counting' Collector. This simply increments a count in the collect(int doc) method and returns true for the acceptsDocOutOfOrder() method. All other methods are NOOPS.

Since 1. will do scoring on all docs, and 2. could have an upfront hit due to loading of the FieldCache, I assume the answer is 3. It just seems odd that Lucene doesn't provide such a collector out of the box?

npellow
  • 1,985
  • 1
  • 16
  • 23

2 Answers2

9

The codes should be here now: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java

sleepsort
  • 1,321
  • 15
  • 28
Robert Muir
  • 3,185
  • 21
  • 17
1

You are right that #3 will be quicker, but I don't think it's because of scoring. There is a much faster way, skip to the bottom if you don't care about the reasoning behind this.

The performance loss of #1 comes from the fact that the TopDocs collector will keep the docs in a priority queue, which means that you will lose some time sorting them by score. (You will also eat up some memory, but since you're storing just a heap of int+float pairs, it's probably pretty minimal.)

As to why Lucene doesn't provide this out of the box: you generally don't want to find all results. That's why when you search, you say to only find the top n results. There are strong theoretical reasons for this. Even Google says "Showing 25 of about n results."

So my advice to you is the following: if you have a reasonable number of results, then using TopDocs.totalHits won't be too bad performance-wise. If the totalHits method gives you problems, I don't think that a custom collector will be much better. (TopDocs.totalHits will run in n log n time, and the custom collector will be linear. Depending on your set up, the log n coefficient may be relevant, or it may not.)

So, if you absolutely need this functionality, and TopDocs.totalHits is too slow, I would recommend looking at the document frequency of the search terms. You could assume that frequency is independent (so p(A and B)=p(A)*p(B)) and make a pretty good guess from there. It will be very fast, because it's just a constant-time lookup for each term.

Xodarap
  • 11,581
  • 11
  • 56
  • 94
  • Thanks for the answer. We will go with a TotalHitCountCollector at this stage. Our data set is still small enough to accurately count. I will keep the term frequency approach you describe in mind though - that does indeed sound the fastest approach. – npellow Feb 07 '11 at 21:48
  • I wonder how Google is doing this. Clearly it isn't really returning the "top 25" results. If it were, then it should know the total number of results as a side-effect of checking all the other results to discover that they were not in the top 25. My theory would be that it is returning 25 essentially arbitrary "worthy of being up the top" results. – Hakanai Mar 15 '11 at 01:21