1

I've been struggling with Solr and how to deal with compound words for our German site. We mainly deal with clothes and accessories so our search terms are usually words relating to wearable items. I've managed to fine tune the DictionaryCompoundWordTokenFilterFactory so that it splits most of the compound search terms that we may encounter (for example: schwarzkleid => schwarz kleid).

However, the search is returning irrelevant results, it returns items that include only the word "schwarz" and also the items that only include the word: "kleid". So instead of only seeing black dresses (schwarzkleid = black dress), I am seeing dresses of different colors and also items that are black.

Essentially Solr is performing an OR on the split tokens and returning any item that contains either keyword.

My complete query is this: q=keywords:schwarzkleid AND deleted:0 (where a 0 indicates that the product has not been sold out yet). The debug of this query is like this:

"debug": {
"rawquerystring": "keywords:schwarzkleid AND deleted:0",
"querystring": "keywords:schwarzkleid AND deleted:0",
"parsedquery": "+((keywords:schwarzkleid keywords:schwarz keywords:kleid)/no_coord) +deleted:0",
"parsedquery_toString": "+(keywords:schwarzkleid keywords:schwarz keywords:kleid) +deleted:`\b\u0000\u0000\u0000\u0000",

This returns a total of 24000+ results whereas if I search directly for keywords:schwarz AND keywords:kleid I will get ~10000 results which is what I want. I am using Solr 4.7 and the Solr PHP library to interact with it through my web application.

Any ideas on how to fine-tune my query to get only the relevant results?

Here is the fieldType in question:

<!-- German -->
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index"> 
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="true" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
  <analyzer type="query"> 
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" format="snowball" enablePositionIncrements="true"/>
    <filter class="solr.GermanNormalizationFilterFactory"/>
    <filter class="org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" compileDict="false" dataDir="/home/ali/Downloads/solr-4.7.0/example/solr/findemode-dev/conf/wordlist/"/>
    <filter class="solr.SnowballPorterFilterFactory" language="German2"/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
  </analyzer>
</fieldType>
halfer
  • 19,824
  • 17
  • 99
  • 186
Alistair
  • 621
  • 1
  • 7
  • 22
  • Would you share the field type from your schema.xml that deals with those nice dresses? – cheffe Apr 23 '14 at 07:58
  • Could you please add the whole fieldType in your question? You cannot post that much code in a comment and you should not. This is what the `edit` under your question is good for :) – cheffe Apr 24 '14 at 15:17
  • Sorry, edited my question to reflect the full fieldType. – Alistair Apr 28 '14 at 07:45
  • A user [asking this question](https://stackoverflow.com/questions/28415766/magento-and-solr-3-6) has asked below: "how did you manage to include the "org.apache.lucene.analysis.de.compounds.GermanCompoundSplitterTokenFilterFactory" into your schema.xml"? That will be deleted by a mod shortly, so if you can help on their question, I am sure they would appreciate it! – halfer Feb 09 '15 at 18:42
  • Wouldn't it be more natural to do the compund word splitting only during indexing but not during querying? A word like "Schwarzkleid" would then be indexed as "Schwarzkleid", "schwarz" and "Kleid", so you can find the document(s) with either of the 3 words. However, when searching for "Schwarzkleid" you would only find documents containing that word, not also documents which contain "schwarz" or "kleid" possibly for other reasons. – thomas.schuerger Aug 11 '21 at 07:36

1 Answers1

1

I've managed to solve this (in a quite hacky sort of way) by using filter queries and the edismax queryparser.

I added in my solrconfig.xml the following parameters:

<str name="defType">edismax</str>
<str name="mm">75%</str>

Then when searching for multiple keywords (for example: schwarzkleid wenz, where wenz is a german brand name), I use the first keyword as a query and anything after that I add as a filterquery. So my final query looks something like this:

fl=id&sort=popular+desc&indent=on&q=keywords:'schwarzkleide'+&wt=json&fq={!edismax}+keywords:'wenz'&fq=deleted:0

My compound splitter filter splits schwarzkleide correctly and it is parsed as edismax with mm=75%, then the filterqueries are added, for keywords they are also parsed as edismax. The returned result is all the black dresses from 'Wenz'.

If anybody has a better solution to what I've posted I would be more than happy to read up on it as I'm quite new to Solr and I think my way is a bit convoluted to be honest.

Thanks.

Alistair
  • 621
  • 1
  • 7
  • 22
  • I'm accepting my own answer for now as it currently solves my problem, but if someone else comes up with a better answer (since I am convinced my method is unnecessarily complicated), I will accept that one. – Alistair Apr 24 '14 at 13:21