Apache Solr Tokenizers

Question

I am using Apache Solr as my semantic search engine. In which users can type anything and I have to retrieve using relevant results using words.

I want to split string in tokens.

Example: "actorsfrommumbai" -> "actors from mumbai"

How can I achieve this feature in solr ?

Possible duplicate of [How to token a word which combined by two words without whitespace](http://stackoverflow.com/questions/25153480/how-to-token-a-word-which-combined-by-two-words-without-whitespace) — MatsLindh, Aug 08 '16 at 11:10
Thanks for reply but This is a tokenizer which will get a field as input while loading data in solr. By what to do when searching by **actorsinmumbai** ? How can I split string when some user search for **actorsinmumbai** ? This is a query time filtration — Mayur Champaneria, Aug 08 '16 at 11:19
Have you _actually_ tried the method suggested? The filter will break the tokens into more tokens, one for each part of the word. You can give different sequences of filters for indexing and querying by using the 'index' and 'query' parameters to the analysis chain definition. — MatsLindh, Aug 08 '16 at 12:29

score 1 · Answer 1 · answered Dec 08 '21 at 16:32

1

You can try using Ngram and EdgeNgram filter and tokenizers available in solr. Because it is a single word and it can only be split with these two since you can not use delimiter here.

answered Dec 08 '21 at 16:32

Dimanshu Parihar

347
2
12

score 0 · Answer 2 · answered Aug 08 '16 at 12:13

0

It looks like you are searching for decompounding -> https://wiki.apache.org/solr/LanguageAnalysis#Decompounding Which gives you the possibility to search for part of compounding words.

answered Aug 08 '16 at 12:13

The Bndr

13,204
16
68
107

score 0 · Answer 3 · answered Sep 20 '16 at 05:50

There is a possibility in solr to configure analyser for decompounding based on dictionary provided. You will have to configure analyser something like this

 <analyzer>
 <tokenizer class="solr.StandardTokenizerFactory"/>
 <filter class="solr.DictionaryCompoundWordTokenFilterFactory"
 dictionary="abc.txt"/>
 </analyzer>

abc.txt is the dictionary.

Note that the analyser apply both at index as well as query time.

Apache Solr Tokenizers

3 Answers3