Solr Dismax handler - whitespace and special character behaviour

Question

I've got strange results when I have special characters in my query.

Here is my request :

q=histoire-france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

Parsed query :

<str name="parsedquery_toString">+((any:histoir any:franc)) ()</str>

I've got 17000 results because Solr is doing an OR (should be AND).

I have no problem when I'm using a whitespace instead of a special char :

q=histoire france&start=0&rows=10&sort=score+desc&defType=dismax&qf=any^1.0&mm=100%

<str name="parsedquery_toString">+(((any:histoir) (any:franc))~2) ()</str>

2000 results for this query.

Here is my schema.xml (relevant parts) :

<fieldType name="text" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="false">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!--<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>-->
        <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" preserveOriginal="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.CommonGramsFilterFactory" words="stopwords_french.txt" ignoreCase="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords_french.txt" enablePositionIncrements="true"/>
        <filter class="solr.SnowballPorterFilterFactory" language="French" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
        <filter class="solr.ASCIIFoldingFilterFactory"/>
      </analyzer>
    </fieldType>

I even tried with a PatternTokenizerFactory to tokenize on whitespaces & special chars but no change...

My current workaround is to replace all special chars by whitespaces before sending query to Solr, but it is not satisfying.

EDIT : Even with a charFilter (PatternReplaceCharFilterFactory) to replace special characters by whitespace, it doesn't work...

First line of analysis via solr admin, with verbose output, for query = 'histoire-france' :

org.apache.solr.analysis.PatternReplaceCharFilterFactory {replacement= , pattern=([,;./\\'&-]), luceneMatchVersion=LUCENE_32}
text    histoire france

The '-' is replaced by ' ', then tokenized by WhitespaceTokenizerFactory. However I still have different number of results for 'histoire-france' and 'histoire france'.

Did i miss something ?

did you reindex the data after changeing from WhiteSpaceTOckenizer to PatternTokenizer ?? you need to reindex the data in order to see any changes — Dorin, Oct 25 '11 at 10:48
You're saying you have can you change it to restart SOLR and share the number of results for each query. If my guess it's true I will give you more detailed explanation later. — Dorin, Oct 25 '11 at 11:01
I changed defaultOperator and restarted solr. No change. Anyway I think Dismax handler is using "mm" (minimum match) parameter instead of default operator. Here I've mm=100% which is the same than having a defaultOperator="AND" for default handler. — Romain Meresse, Oct 25 '11 at 11:09
If i use mm=0% (defaultOperator="OR") I've 17000 results for each query — Romain Meresse, Oct 25 '11 at 11:14
I think DISMAX doesn't care about defaultOperator when building the query, and it sees "histoire-france" as a single word, and "histoire france" as 2 separate words. Sorry I couldn't help more. — Dorin, Oct 25 '11 at 11:31

Grimmo · Answer 1 · 2012-02-07T00:53:58.843

You get different number of results searching for 'histoire-france' and 'histoire france' because query parser creates a phrase query in the first case, and a boolean query (separate two words) in the second case.

This is not obvious behavior imho, but i believe it's hard to satisfy all use cases.

To make search treating 'histoire-france' as simply two words you can add "solr.PositionFilterFactory" to the end of query analyzer like:

  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PositionFilterFactory" />
  </analyzer>

Then search results for 'histoire-france' and 'histoire france' will be equal.

Note that position filter can be undesired for phrase searches (both 'historie' and 'france' to be present). Consider using of query slops parameter qs > 0 instead in case you have modified term sequence with say NGram filter.

score 1 · Accepted Answer · answered Jan 24 '13 at 09:38

It was a bug : https://issues.apache.org/jira/browse/SOLR-3589

With edismax mm set to 100% if one of the tokens is split into two tokens by the analyzer chain (i.e. "fire-fly" => fire fly), the mm parameter is ignored and the equivalent of OR query for "fire OR fly" is produced. This is particularly a problem for languages that do not use white space to separate words such as Chinese or Japenese.

It is fixed in Solr 4.1 (22 January 2013)

score 1 · Answer 3 · answered Oct 25 '11 at 15:02

using WhitespaceTokenizerFactory, Solr will split your query string into words.

But, after tokenizing you(Solr) split your word (again) into terms using solr.WordDelimiterFilterFactory. Look at the documentation and look at the Wi-Fi example.

That could be one reason, why histoire france and histoire-france are handled different.

2nd: don't forget, that the DSIMAX handles (normally) the query-term as "term" and also (additional) as parsed string again.

To solve your problem, you could try to avoid the world delimiter and try to handle "tokenizing" by using PatternTokenizerFactory (as you tried before, but now without WordDelimiterFilterFactory).

If that doesn't work, try to post the complete output of the analysys.jsp

Jayendra · Answer 4 · 2011-10-26T13:17:57.940

Enable the autoGeneratePhraseQueries to true and this would generate the phrase queries.
So when searched for histoire-franc, it would generate a query with quotes which will enable only the documents having both words as a phrase being matched.

<str name="parsedquery">(+DisjunctionMaxQuery(((any:histoire any:franc))))/no_coord</str>

Example working configuration -

<fieldType name="text_test" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

Use query slop to specify the number of slops e.g. qs=10 in a phrase query.

<str name="parsedquery">(+DisjunctionMaxQuery((any:"histoire france"~10)))/no_coord</str>

If I add autoGeneratePhraseQueries, a phrase query is generated for "france-histoire" but not for "france histoire". Suppose I have a document containing "histoire de la france". Then the phrase "france-histoire" will not match... — Romain Meresse, Oct 26 '11 at 08:20

Solr Dismax handler - whitespace and special character behaviour

4 Answers4

Linked