In our search based on Solr, we have started by using phrases. For example, when the user types
blue dress
then the Solr query will be
title:"blue dress" OR description:"blue dress"
We now want to remove stop words. Using the default StopFilterFactory, the query
the blue dress
will match documents containing "blue dress" or "the blue dress".
However, when typing
blue the dress
then it does not match documents containing "blue dress".
I am starting to wonder if we shouldn't instead only search using single terms. That is, convert the above user search into
title:the OR title:blue OR title:dress OR description:the OR description:blue OR description:dress
I am a bit reluctant to do this, though, as it seems doing the work of the StandardTokenizerFactory.
Here is my schema.xml:
<fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="25" />
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="English" />
</analyzer>
</fieldType>
The title and the description fields are both of type text_general.
Is the single terms search the standard way of searching in Solr? Am I exposing ourselves to problems by tokenising the words before calling Solr (performance issues, maybe)? Maybe thinking in term of single terms vs. phrases is just wrong and we should leave it to the user to decide?