Searching for Solr Stop words

Question

On of my solr fields is configured in the following manned,

<fieldType name="text_exact" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
 <analyzer type="index">
    <charFilter class="solr.HTMLStripCharFilterFactory"/>
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
    <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <analyzer type="query">
   <tokenizer class="solr.WhitespaceTokenizerFactory"/>
   <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true"/>
   <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" preserveOriginal="1" types="wdfftypes.txt"/>
   <filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

This works in cases where i don't want stemming, but now there is another use case which is causing a problem, people are beginning to seach for the following combinations,

The Ivy : In this case results with just ivy is being returned, when the expected result would be with The. I understand that this is because of the stop word but is the way to achieve this. For example if they search for "the ivy" within quotes than this should work.
(Mom & Me) OR ("mom and me"): In this case also & is dropped or results including both mom and me in some part of the statement is returned.

I am ok if only new data behaves in the right way but wouldnt be able to reindex. Also, would changing the schema.xml file trigger a full replication?

Regards,
Ayush

score 0 · Answer 1 · answered Jan 06 '13 at 09:13

You are using the white space tokenizer. So "The Ivy" is slitted into 2 words.

You could use an less agressive tokenize an followed by the WordDelimiterFilterFactory in order to activate the protected="protwords.txt" options, where you can set "the ivy" as an protected word so that solr will not tokenize that.

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory

Searching for Solr Stop words

1 Answers1