I have an index whose documents have two fields (actually more like 800 fields but the other fields won't concern us here):
- The
contents
field contains the analyzed/tokenized text of the document. The query string is searched for in this field. - The
category
field contains the single category identifier of the document. There are about 2500 different categories, and a document may occur in several of them (i.e. a document may have multiplecategory
entries. The results are filtered by this field.
The index contains about 20 mio. documents and is 5 GB in size.
The index is queried with a user-provided query string, plus an optional set of a few categories the user is not interested in. The question is: how can I remove those documents matching not only the query string but also the unwanted categories.
I could use a BooleanQuery
with a MUST_NOT
clause, i.e. something like this:
BooleanQuery q = new BooleanQuery();
q.add(contentQuery, BooleanClause.MUST);
for (String unwanted: unwantedCategories) {
q.add(new TermsQuery(new Term("category", unwanted), BooleanClause.MUST_NOT);
}
Is there a way to do this with Lucene filters? Performance is an issue here, and there will only be a few, recurring, variants of unwantedCategories
, so a CachingWrapperFilter
would probably help a lot. Also, due to the way the Lucene queries are generated in the existing code base, it is difficult to fit this in, whereas an extra Filter
could be introduced easily.
In other words, How do I create a Filter
based on what terms must _not_ occur in a document?