I am moving some existing Index from Lucene to Solr. We apply the following Logic on the Input text in Lucene:
- to lower case
- replaceDictionaryWords (replace some specific words by other words example replace "hertz" by "htz")
- extract characters and digits only
- trim output string
- replace \s+ by \s
- split using java.lang.String#split(in) method
- for each splitted text, divide the result word by the following pattern: "ABCDEF" => ABC BCD CDE DEF (divide on 3, 2)
I don't want to write Tokenizer that might be exist.
So, I looked here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters but get missed.