0

I am moving some existing Index from Lucene to Solr. We apply the following Logic on the Input text in Lucene:

  1. to lower case
  2. replaceDictionaryWords (replace some specific words by other words example replace "hertz" by "htz")
  3. extract characters and digits only
  4. trim output string
  5. replace \s+ by \s
  6. split using java.lang.String#split(in) method
  7. for each splitted text, divide the result word by the following pattern: "ABCDEF" => ABC BCD CDE DEF (divide on 3, 2)

I don't want to write Tokenizer that might be exist.

So, I looked here http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters but get missed.

halfer
  • 19,824
  • 17
  • 99
  • 186
Muhammad Hewedy
  • 29,102
  • 44
  • 127
  • 219

2 Answers2

1
  1. LowerCaseFilter,
  2. SynonymFilter,
  3. StandardTokenizer or PatternTokenizer,
  4. TrimFilter,
  5. PatternReplaceFilter,
  6. WordDelimiterFilter?
  7. NGramTokenFilter (you may need to write a factory for this one).

But if you already have an existing Lucene analyzer, you can make Solr use it.

jpountz
  • 9,904
  • 1
  • 31
  • 39
0

Try OpenPipeline. It's designed for preprocessing documents that get fed to search software.

ccleve
  • 15,239
  • 27
  • 91
  • 157