0

I am building a search index with clucene and I want to make sure docs containing any offensive terms never get added to the index. Using a StandardAnalyzer with stop list is not good enough since the offensive doc still gets added and would be returned for non-offensive searches.

Instead I am hoping to build up a document, then check if it contains any offensive words, then adding it only if it doesn't.

Cheers!

duffy
  • 615
  • 1
  • 9
  • 25

1 Answers1

0

You can't really access that type of data in a Document

What you can do is run the analysis chain manually on the text and check each token individually. You can do this in a stupid loop, or by adding another analyzer to the chain that just raises a flag you check later.

This introduces some more work, but the best way to achieve that IMO.

synhershko
  • 4,472
  • 1
  • 30
  • 37