I've been struggling with Solr and how to deal with compound words for our German site. We mainly deal with clothes and accessories so our search terms are usually words relating to wearable items. I've managed to fine tune the DictionaryCompoundWordTokenFilterFactory
so that it splits most of the compound search terms that we may encounter (for example: schwarzkleid => schwarz kleid).
However, the search is returning irrelevant results, it returns items that include only the word "schwarz" and also the items that only include the word: "kleid". So instead of only seeing black dresses (schwarzkleid = black dress), I am seeing dresses of different colors and also items that are black.
Essentially Solr is performing an OR on the split tokens and returning any item that contains either keyword.
My complete query is this: q=keywords:schwarzkleid AND deleted:0
(where a 0 indicates that the product has not been sold out yet). The debug of this query is like this:
"debug": {
"rawquerystring": "keywords:schwarzkleid AND deleted:0",
"querystring": "keywords:schwarzkleid AND deleted:0",
"parsedquery": "+((keywords:schwarzkleid keywords:schwarz keywords:kleid)/no_coord) +deleted:0",
"parsedquery_toString": "+(keywords:schwarzkleid keywords:schwarz keywords:kleid) +deleted:`\b\u0000\u0000\u0000\u0000",
This returns a total of 24000+ results whereas if I search directly for keywords:schwarz AND keywords:kleid
I will get ~10000 results which is what I want. I am using Solr 4.7 and the Solr PHP library to interact with it through my web application.
Any ideas on how to fine-tune my query to get only the relevant results?
Here is the fieldType in question:
<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
<analyzer type="index">
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="true" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
<analyzer type="query">
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
<filter class="solr.GermanNormalizationFilterFactory"/>
<filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="false" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
<tokenizer class="solr.StandardTokenizerFactory"/>
</analyzer>
</fieldType>