Lucene Indexing to ignore apostrophes

Question

I have a field that might have apostrophes in it. I want to be able to: 1. store the value as is in the index 2. search based on the value ignoring any apostrophes.

I am thinking of using:

   doc.add(new Field("name", value, Store.YES, Index.NO));
   doc.add(new Field("name", value.replaceAll("['‘’`]",""), Store.NO, Index.ANALYZED));

if I then do the same replace when searching I guess it should work and use the cleared value to index/search and the value as is for display.

am I missing any other considerations here ?

score 0 · Answer 1 · answered Jul 05 '12 at 17:37

Performing replaceAll directly on the value its a bad practice in Lucene, since it would a much better practice to encapsulate your tokenization recipe in an Analyzer. Also I don't see the benefit of appending fields in your use case (See Document.add).

If you want to Store the original value and yet be able to search without the apostrophes simply declare your field like this:

doc.add(new Field("name", value, Store.YES, Index.ANALYZED);

Then simply hook up a custom Tokenizer that will replace apostrophes (I think the Lucene's StandardAnalyzer already includes this transformation).

If you are storing the field with the aim of using highlighting you should also consider using Field.TermVector.WITH_POSITIONS_OFFSETS.

Thanks jspboix, Where and how should I hook up the custom Tokenizer ? Do I need a Tokenizer or an Analizer? — epeleg, Jul 05 '12 at 20:44
I am accepting this answer as it is probably the right way to go. As for myself I ended up with the two `.add` calls as described in the Q. — epeleg, Jul 16 '12 at 14:03

Lucene Indexing to ignore apostrophes

1 Answers1

Linked