Solr dismax behaviour - punctuation and white space splitting

Question

I have a Solr 4.7.0 instance, with 200 000 documents in the index (one document per file on a filesystem), used by several users. Documents are identified by keywords, that are indexed and stored in one field called "signature_1". During the index, I remove all type of punctuation that I replace with white space (thanks to a ScriptUpdateProcessor), so my keywords are separated with white spaces, both in the index and stored part of the field signature_1 (fieldtype signature).

<fieldType name="signature" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <analyzer type="index">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.LimitTokenCountFilterFactory" maxTokenCount="1000" consumeAllTokens="false"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers_secteurs.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
  <analyzer type="query">
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="([^a-zA-Z0-9éèàùêâûôîäëöüï])" replacement=" "/>
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.ASCIIFoldingFilterFactory"/>
    <!--<filter class="solr.StopFilterFactory" ignoreCase="true" words="lang\stopwords_fr.txt" enablePositionIncrements="true" />-->
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms_chantiers.txt" ignoreCase="true" expand="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.SnowballPorterFilterFactory" language="French" />
  </analyzer>
</fieldType>

I would like the same behaviour during the query time : if somebody search for

A-B-C

I would like Solr to do the following search (with an OR operator, dismax) :

A B C

So basically, I simply want Solr to search between document's keywords, punctuation beeing removed.

The upper example is working well, but in some case it's not working this way. With a query of

A B-C

Dismax splits the query in

(+(DisjunctionMaxQuery((signature_1:a)) DisjunctionMaxQuery((signature_1:"b c"))) ())/no_coord

and this messes up the relevancy (i.e. the order) of my results. I tried using autoGeneratePhraseQueries="True" but without effect.

So I would like Dismax to always split on whitespace AND punctuation or never do it (results will be the same). Any idea how I can manage to do this (without having to create my Java Dismax class) ?

The following posts are related to my problem :

score 0 · Answer 1 · answered Sep 22 '14 at 22:59

0

I'm not really clear on whether you want A B-C to be a phrase query ("A B C") or three separate term queries (A B C), but:

If you want it to be a phrase query, just wrap the whole thing in quotes: "A B-C"

If you want each term to be searched separately, just remove the punctuation yourself, leaving A B C.

The query parser separates query clauses at spaces, generally, not punctuation. This doesn't have to do with analysis, it's just query parser syntax. So, for A B-C, you end up with two query clauses, A and B-C. When analysis kicks in, B-C is split into two terms, and so the query parser makes it a phrase query instead of a term query, and in the end result looks something like A "B C"

answered Sep 22 '14 at 22:59

femtoRgon

32,893
7
60
87

Thanks for your answer. I do not want A B-C to be a phrase query, I want 3 separate term queries. I have edited my comment to reflect the fact that I am not the only user, so your solution does not work for me, I do not want to ask user to remove ponctuation in their requests (also because some query will be constructed with copy/past of things containing ponctuations). – Vincent Ardiet Sep 23 '14 at 06:20
I was thinking more along the lines of having some logic to normalize search text, rather than training the user. – femtoRgon Sep 23 '14 at 13:15
What kind of logic ? I am using the velocity template, a query goes directly from the text area of the form element in the webpage to the DisjunctionMaxQuery of Solr (correct me if I'm wrong). Where would I implement such logic ? – Vincent Ardiet Oct 31 '14 at 11:05

score 0 · Accepted Answer · edited May 23 '17 at 11:55

I finally found a solution, it's a bit "quick and dirty" but it's working : in Velocity, I created a Javascript function to edit the q field, this function is called using the parameter onsubmit of a GET form (it's described in stackoverflow.com/questions/5763055/edit-value-of-a-html-input-form-by-javascript).

But you need Velocity for this solution, if you are using a Request Handler without velocity (or more generally an HTML interface) it's not working.

Solr dismax behaviour - punctuation and white space splitting

2 Answers2