0

I implement a faceted search using Lucene. I have an index of documents and an index of a taxonomy. Then I collect facets for a given level of the taxonomy.

My question is: How can I get the number of documents indexed in a given Category of the Taxonomy?

I think that my question is quite simple but I couldn't find any method in the Lucene's API nor searching in Google. I only found how to get the number of documents in the whole index using the numDocs() method of the IndexReader class.

synack
  • 1,699
  • 3
  • 24
  • 50

2 Answers2

1

If you have one term for each category in the index, perhaps you can use something like TermEnum.docFreq()? You can get the TermEnum object from IndexReader.terms(Term).

Kai Chan
  • 2,463
  • 1
  • 14
  • 13
  • No, it is a relation 1 - n between category and documents. I have n documents indexed under a given category, not terms. – synack Oct 17 '12 at 21:19
  • 1
    @Kits89 You can make up a term for each of the categories you have, such that there's a 1-1 mapping between the categories and the terms you make up. And by term, I'm referring to Lucene's term, along the line of `new Term("category", "Business/Investing/Funds/Hedge_Funds")`. Your documents have a category field, right? If you have Lucene index the field without analyzing it, that should take care of the indexing part. Then you can, in the searching part, create the Term object I just mentioned, and call the methods I mentioned earlier with this Term object. – Kai Chan Oct 17 '12 at 22:51
  • Now I see what you mean. Indeed, I index the documents with a `Category` field like you say. I'll try to do what you say, thanks. – synack Oct 18 '12 at 11:44
0

I don't really know enough about your index structure to suggest the correct query for you, but if you execute a query searching for all the documents in your category, then the returned set of results will generally have a count of the total number of hits for the query.

For instance, if you query using either of:

search(Query query, int n)
search(Query query, Filter filter, int n) 

Then you will get a TopDocs object back, from which you can get the total number of hits back from: TopDocs.totalHits.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • The taxonomy has the [ODP](http://www.dmoz.org) directory structure. The documents are the web pages classified in the ODP and I index them using the path in the directory structure of the ODP. I thought that searching for all the documents in the category could be a solution but, how can I do it? In your answer, I don't see why `totalHits` would return the number of documents in the category... – synack Oct 17 '12 at 21:17
  • If you enter a query that gets precisely the set of documents in that category, then totalHits would be the number you are looking for. A prefix query such as 'directory:arts/television*' might get you what your looking for, or you might use a phrase query, or you could look for individual path components combined with the + operator, which might make more sense, unless that would lead to collisions. Depends somewhat on the representation of the data (ie, the analyzer used, etc.). – femtoRgon Oct 18 '12 at 16:42