preventing certain docs from being indexed in clucene

Question

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.

Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.

Cheers!

score 0 · Accepted Answer · answered Oct 16 '13 at 21:38

You can't really access that type of data in a Document

What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.

This introduces some more work, but the best way to achieve that IMO.

preventing certain docs from being indexed in clucene

1 Answers1