4

I have a word like lovelive, which is combined by two simple words love and live without whitespace.

I wanna know which kind of Lucene Analyzer can token this kind of words into two separate words?

Cao Dongping
  • 969
  • 1
  • 12
  • 29

1 Answers1

4

Have a look at the DictionaryCompoundWordTokenFilter as described in the solr reference

This filter splits, or decompounds, compound words into individual words using a dictionary of the component words. Each input token is passed through unchanged. If it can also be decompounded into subwords, each subword is also added to the stream at the same logical position.

In: "Donaudampfschiff dummkopf"

Tokenizer to Filter: "Donaudampfschiff"(1), "dummkopf"(2),

Out: "Donaudampfschiff"(1), "Donau"(1), "dampf"(1), "schiff"(1), "dummkopf"(2), "dumm"(2), "kopf"(2)

As you can see in the sample configuration, you will need a dictionary in the language you want to split, in the sample there they use a germanwords.txt that contains the words they want to decompose, if found composed. In your case this would be love and live.

<analyzer>
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="germanwords.txt"/>
</analyzer>

For Lucene it is org.apache.lucene.analysis.compound.DictionaryCompoundWordTokenFilter. The code is to be found on github.

Community
  • 1
  • 1
cheffe
  • 9,345
  • 2
  • 46
  • 57