Fastest way to count all results in Lucene (java)

Question

What is the fastest way to count all results for a given Query in Lucene?

TopDocs.totalHits
implement and manage a Filter, using QueryFilter
implement a custom 'counting' Collector. This simply increments a count in the collect(int doc) method and returns true for the acceptsDocOutOfOrder() method. All other methods are NOOPS.

Since 1. will do scoring on all docs, and 2. could have an upfront hit due to loading of the FieldCache, I assume the answer is 3. It just seems odd that Lucene doesn't provide such a collector out of the box?

score 9 · Accepted Answer · edited Oct 24 '12 at 12:55

9

The codes should be here now: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/core/src/java/org/apache/lucene/search/TotalHitCountCollector.java

edited Oct 24 '12 at 12:55

sleepsort

1,321
15
28

answered Feb 07 '11 at 16:48

Robert Muir

3,185
21
17

Thanks for that. Looks like it will be in an upcoming Lucene release? – npellow Feb 07 '11 at 21:45
yes, this is the branch for upcoming 3.x releases (next target is 3.1) – Robert Muir Feb 09 '11 at 12:13

Xodarap · Answer 2 · 2011-02-07T17:01:54.077

You are right that #3 will be quicker, but I don't think it's because of scoring. There is a much faster way, skip to the bottom if you don't care about the reasoning behind this.

The performance loss of #1 comes from the fact that the TopDocs collector will keep the docs in a priority queue, which means that you will lose some time sorting them by score. (You will also eat up some memory, but since you're storing just a heap of int+float pairs, it's probably pretty minimal.)

As to why Lucene doesn't provide this out of the box: you generally don't want to find all results. That's why when you search, you say to only find the top n results. There are strong theoretical reasons for this. Even Google says "Showing 25 of about n results."

So my advice to you is the following: if you have a reasonable number of results, then using TopDocs.totalHits won't be too bad performance-wise. If the totalHits method gives you problems, I don't think that a custom collector will be much better. (TopDocs.totalHits will run in n log n time, and the custom collector will be linear. Depending on your set up, the log n coefficient may be relevant, or it may not.)

So, if you absolutely need this functionality, and TopDocs.totalHits is too slow, I would recommend looking at the document frequency of the search terms. You could assume that frequency is independent (so p(A and B)=p(A)*p(B)) and make a pretty good guess from there. It will be very fast, because it's just a constant-time lookup for each term.

Thanks for the answer. We will go with a TotalHitCountCollector at this stage. Our data set is still small enough to accurately count. I will keep the term frequency approach you describe in mind though - that does indeed sound the fastest approach. — npellow, Feb 07 '11 at 21:48
I wonder how Google is doing this. Clearly it isn't really returning the "top 25" results. If it were, then it should know the total number of results as a side-effect of checking all the other results to discover that they were not in the top 25. My theory would be that it is returning 25 essentially arbitrary "worthy of being up the top" results. — Hakanai, Mar 15 '11 at 01:21

Fastest way to count all results in Lucene (java)

2 Answers2