Using multiple tokenizers in Solr

Question

What I want to be able to do is perform a query and get results back that are not case sensitive and that match partial words from the index.

I have a Solr schema set up at the moment that has been modified so that I can query and return results no matter what case they are. So, if I search for iPOd, Iwill see iPod returned. The code to do this is:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
...
</fieldType>

I have found this code that will allow us to do a partial word match query, but I don't think I can have two tokenizers on one field.

<fieldType name="text" class="solr.TextField" >
  <analyzer type="index">
    <tokenizer class="solr.NGramTokenizerFactory" minGramSize="3" maxGramSize="15" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
...
</fieldType>

So what can I do to perform this tokenizer on the field as well?
Or is there a way to merge them?
Or is there another way I can accomplish this task?

score 10 · Accepted Answer · answered Aug 05 '10 at 06:34

10

Declare another fieldType (i.e. a different name) that has the NGram tokenizer, then declare a field that uses the fieldType with NGram and another field with the standard "text" fieldType. Use copyField to copy one to another. See Indexing same data in multiple fields.

answered Aug 05 '10 at 06:34

Mauricio Scheffer

98,863
23
192
275

1

But how now query that so the results will look in data that was tokenized with both tokenizers? In other words - how to get results from both tokenizers at once? – kul_mi Nov 19 '14 at 10:37

score 8 · Answer 2 · edited Mar 01 '19 at 17:01

An alternative would be to apply the EdgeGramFilterFactory to the existing field and stay with your current tokenizer (WhitespaceTokenizerFactory), e.g.

<filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />

This would keep your current schema unchanged, i.e. you would not need an additional field which has another tokenizer (NGramTokenizerFactory)

Your field look then something like the below:

   <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="15" />
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
  </analyzer>
...
</fieldType>

Using multiple tokenizers in Solr

2 Answers2