How to configure stemming in Solr?

Question

I add to solr index: "American". When I search by "America" there is no results.

How should schema.xml be configured to get results?

current configuration:

<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.WhitespaceTokenizerFactory" />
                <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
                <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
                <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" />
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
                <filter class="solr.PorterStemFilterFactory"/>
            </analyzer>
        </fieldType>

Marko Bonaci · Answer 1 · 2011-03-12T22:43:34.483

4

Why would you have two stemmers?
Try removing EnglishPorterFilterFactory (deprecated) from both of your analyzer types, rebuild the index and then try whether search for American will yield America.

If that wont work, the other thing you can try is to remove both of your stemmer filters and add SnowballPorterFilterFactory with language="English" instead.

edited Mar 12 '11 at 22:43

answered Mar 12 '11 at 22:38

Marko Bonaci

5,622
2
34
55

Tried both approaches. The same. – user657009 Mar 12 '11 at 22:51
index: "Slots" There are results when i search by: "Slots", "Slot", "Slotting". index: "American" No results by: "American". – user657009 Mar 12 '11 at 22:55
4

First thing you should do is to open your Solr admin web app, go to Analysis and choose your field type/name (check both verbose output fields), type American in Index field and America in the Query field. This will allow you to see how exactly it gets analyzed, filter by filter. For more detailed analysis download [Luke](http://www.getopt.org/luke/luke-0.9.9/lukeall-0.9.9.jar) if you don't have it already (it's the executable jar). Start it and load Lucene index. Use it to find out how exactly your content got stemmed and many other useful info... – Marko Bonaci Mar 12 '11 at 23:06
You re-indexed the content between tries, right? OK, now you can (using Admin > Analysis) see exactly which effects each of those stemmers has on the word 'American'. – Marko Bonaci Mar 12 '11 at 23:12
5

According to http://snowball.tartarus.org/demo.php, **American**, when stemmed, is **left intact**. – Marko Bonaci Mar 12 '11 at 23:21
strange...I added: – user657009 Mar 12 '11 at 23:28
What does DoubleMetaphone returns for American and what for America? – Marko Bonaci Mar 13 '11 at 01:07

score 2 · Answer 2 · answered Apr 05 '17 at 06:57

You have to use one stemmer for an analyzer and EnglishPorterFilterFactory is deprecated as @Marko already mentioned. So you should remove this one from analyzers.

I used SnowballPorterFilterFactory for both index and query analyzers -

<fieldType name="text_stem">
    <analyzer> 
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SnowballPorterFilterFactory"/>
        <!-- other filters -->
    </analyzer>
</fieldType>

The fieldType definition is pretty self explanatory, but just in case:

Tokenizer solr.WhitespaceTokenizerFactory: This operation will break up the sentences into words, using whitespaces as delimiters.
Filter solr.SnowballPorterFilterFactory: This filter will apply a stemming algorithm to each word (token). In the example above I have chosen the Snowball Porter stemming algorithm. Solr provides a few implementation of popular stemming algorithms.

You can browse several other stemming algorithms e.g. HunspellStemFilterFactory, KStemFilterFactory too.

How to configure stemming in Solr?

2 Answers2

Linked