SOLR eDismax typo tolerance for phrases

Question

How is possible to build the query which will search for exact phrases as well as phrases with some typos? I'm stuck on this and looks like I'm moving in wrong direction.

For example, I have next field in my edismax query:

q=apple iphone

It works, but now I need to make it more tolerant to typos. I update my query and now it returns same results as before even when user types with faults:

q=aple~2 iphane~2

Next I found what now the exact query match is not always on first page (for example, I really have product 'aple iphane'). So, I add the exact query using 'OR' condition. Now my query looks like

q=(aple~2 iphane~2) OR 'aple iphane'^3

Problem is, it now returns only exact match and does not returns fuzzy entries enymore. What I'm doing wrong?

Here is full query:

http://localhost:8983/solr/test/select?omitHeader=true
&q=(aple~2 iphane~2) OR 'aple iphane'^3
&start=0
&rows=30
&fl=*,score
&fq=itemType:"Product"
&defType=edismax
&qf=title_de^1000 title_de_ranked^1000 description_de^1 category_name_de^50 brand^15 merchant_name^80 uniuque_values^10000 searchable_attribute_product.name^1000 searchable_attribute_product.description.short^100 searchable_attribute_product.description.long^100 searchable_attribute_mb.book.author^500
&mm=90
&pf=title_de^2000 description_de^2
&ps=1
&qs=2
&boost=category_boost
&mm.autoRelax=true
&wt=json
&json.nl=flat

Do I have error in query, or the way I selected is totally wrong?

This phrase I want to find in 'title_de', all other field are secondary. Here it's fieldtype from my schema:

<fieldType name="text_de_ngram" class="solr.TextField" positionIncrementGap="100">
    <analyzer type="index">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.GermanNormalizationFilterFactory"/>
        <filter class="solr.GermanLightStemFilterFactory"/>
        <filter class="solr.NGramFilterFactory" minGramSize="2" maxGramSize="25"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
        <filter class="solr.GermanNormalizationFilterFactory"/>
        <filter class="solr.GermanLightStemFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="German" />
    </analyzer>
</fieldType>

Thanks!

UPD: I found what my query (q=(aple~2 iphane~2) OR 'aple iphane'^3) was incorrect, so I found how to build 2 other queries, which works better, you can see them in the end of post. I still do not know why they gives different results, because default operator for SOLR query is 'OR', so 'term1 OR term2 OR term3 OR term4' should be same as '(term1 OR term2) OR (term3 OR term4).
As suggested by @Persimmonium, I add some debug examples to show what fuzzy queries for edismax works (but not always es expected). I found the 'apple iphone' is not best example on my large and German language index, so I used the product with name 'Samsung Magic Info-Lite' as example.

Here are all params for my query:

"params":{
      "mm":"100%",
      "q":"samsung magic",
      "defType":"edismax",
      "indent":"on",
      "qf":"title_de",
      "fl":"*,score",
      "pf":"title_de",
      "wt":"json",
      "debugQuery":"on",
      "_":"1501409530601"
}

So, this query returns me right products (I have 6 products with both this words in title_de field). After I add typos to both words:

"q":"somsung majic"

No products are found.

Then I add fuzzy operators to both words:

"q":"somsung~2 majic~2"

6 products are found. Here is the debugQuery result:

"debug":{
      "rawquerystring":"somsung~2 majic~2",
      "querystring":"somsung~2 majic~2",
      "parsedquery":"(+(DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2)))~2 DisjunctionMaxQuery((title_de:\"somsung 2 majic 2\")))/no_coord",
      "parsedquery_toString":"+(((title_de:somsung~2) (title_de:majic~2))~2) (title_de:\"somsung 2 majic 2\")",
      "explain":{
            "69019":"\n1.3424492 = sum of:\n  1.3424492 = sum of:\n    1.1036766 = sum of:\n      0.26367697 = weight(title_de:amsung in 305456) [ClassicSimilarity], result of:\n        0.26367697 = score(doc=305456,freq=1.0), product of:\n          0.073149204 = queryWeight, product of:\n            0.6666666 = boost\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.015219777 = queryNorm\n          3.604646 = fieldWeight in 305456, product of:\n            1.0 = tf(freq=1.0), with freq of:\n              1.0 = termFreq=1.0\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.5 = fieldNorm(doc=305456)\n      0.2373093 = weight(title_de:msung in 305456) [ClassicSimilarity], result of:\n        0.2373093 = score(doc=305456,freq=1.0), product of:\n          0.06583429 = queryWeight, product of:\n            0.6 = boost\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.015219777 = queryNorm\n          3.604646 = fieldWeight in 305456, product of:\n            1.0 = tf(freq=1.0), with freq of:\n              1.0 = termFreq=1.0\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.5 = fieldNorm(doc=305456)\n      0.26367697 = weight(title_de:samsun in 305456) [ClassicSimilarity], result of:\n        0.26367697 = score(doc=305456,freq=1.0), product of:\n          0.073149204 = queryWeight, product of:\n            0.6666666 = boost\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.015219777 = queryNorm\n          3.604646 = fieldWeight in 305456, product of:\n            1.0 = tf(freq=1.0), with freq of:\n              1.0 = termFreq=1.0\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.5 = fieldNorm(doc=305456)\n      0.33901328 = weight(title_de:samsung in 305456) [ClassicSimilarity], result of:\n        0.33901328 = score(doc=305456,freq=1.0), product of:\n          0.094048984 = queryWeight, product of:\n            0.85714287 = boost\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.015219777 = queryNorm\n          3.604646 = fieldWeight in 305456, product of:\n            1.0 = tf(freq=1.0), with freq of:\n              1.0 = termFreq=1.0\n            7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              635.0 = docFreq\n              316313.0 = docCount\n            0.5 = fieldNorm(doc=305456)\n    0.23877257 = sum of:\n      0.23877257 = weight(title_de:magic in 305456) [ClassicSimilarity], result of:\n        0.23877257 = score(doc=305456,freq=1.0), product of:\n          0.0762529 = queryWeight, product of:\n            0.8 = boost\n            6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              1638.0 = docFreq\n              316313.0 = docCount\n            0.015219777 = queryNorm\n          3.1313245 = fieldWeight in 305456, product of:\n            1.0 = tf(freq=1.0), with freq of:\n              1.0 = termFreq=1.0\n            6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n              1638.0 = docFreq\n              316313.0 = docCount\n            0.5 = fieldNorm(doc=305456)\n",
      },
      "QParser":"ExtendedDismaxQParser"
}

This behavior satisfies me till I have no real product with name 'Somsung majic'. It's theoretical situation, but on practice there is a lot of other incorrect search results caused by this fuzzy operators.

So, to handle such things my idea was as I described initially to add the exact entry (without fuzzy modifiers) with boosting factor. So, now is the question, how it will be implement better. I found, what this query works acceptable if I decrease the mm parameter:

"q":"somsung~2 majic~2 somsung^3 majic^3"

It's because I add more words to query, so 'minimum should match' need to be also decreased. Problem is, what decreasing 'mm' I get bad results on long titles with exact title entry (some wrong items could be ranked higher because of other factors). This is debug for it:

"debug":{
      "rawquerystring":"somsung~2 majic~2 somsung^3 majic^3",
      "querystring":"somsung~2 majic~2 somsung^3 majic^3",
      "parsedquery":"(+(DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2)) DisjunctionMaxQuery((title_de:somsung))^3.0 DisjunctionMaxQuery((title_de:majic))^3.0)~2 DisjunctionMaxQuery((title_de:\"somsung 2 majic 2 somsung 3 majic 3\")))/no_coord",
      "parsedquery_toString":"+(((title_de:somsung~2) (title_de:majic~2) ((title_de:somsung))^3.0 ((title_de:majic))^3.0)~2) (title_de:\"somsung 2 majic 2 somsung 3 majic 3\")",
      "explain":{
            "69019":"\n0.3418829 = sum of:\n  0.3418829 = product of:\n    0.6837658 = sum of:\n      0.5621489 = sum of:\n        0.13430178 = weight(title_de:amsung in 305456) [ClassicSimilarity], result of:\n          0.13430178 = score(doc=305456,freq=1.0), product of:\n            0.037257966 = queryWeight, product of:\n              0.6666666 = boost\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.0077520725 = queryNorm\n            3.604646 = fieldWeight in 305456, product of:\n              1.0 = tf(freq=1.0), with freq of:\n                1.0 = termFreq=1.0\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.5 = fieldNorm(doc=305456)\n        0.12087161 = weight(title_de:msung in 305456) [ClassicSimilarity], result of:\n          0.12087161 = score(doc=305456,freq=1.0), product of:\n            0.033532172 = queryWeight, product of:\n              0.6 = boost\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.0077520725 = queryNorm\n            3.604646 = fieldWeight in 305456, product of:\n              1.0 = tf(freq=1.0), with freq of:\n                1.0 = termFreq=1.0\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.5 = fieldNorm(doc=305456)\n        0.13430178 = weight(title_de:samsun in 305456) [ClassicSimilarity], result of:\n          0.13430178 = score(doc=305456,freq=1.0), product of:\n            0.037257966 = queryWeight, product of:\n              0.6666666 = boost\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.0077520725 = queryNorm\n            3.604646 = fieldWeight in 305456, product of:\n              1.0 = tf(freq=1.0), with freq of:\n                1.0 = termFreq=1.0\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.5 = fieldNorm(doc=305456)\n        0.17267373 = weight(title_de:samsung in 305456) [ClassicSimilarity], result of:\n          0.17267373 = score(doc=305456,freq=1.0), product of:\n            0.047903106 = queryWeight, product of:\n              0.85714287 = boost\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.0077520725 = queryNorm\n            3.604646 = fieldWeight in 305456, product of:\n              1.0 = tf(freq=1.0), with freq of:\n                1.0 = termFreq=1.0\n              7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                635.0 = docFreq\n                316313.0 = docCount\n              0.5 = fieldNorm(doc=305456)\n      0.12161691 = sum of:\n        0.12161691 = weight(title_de:magic in 305456) [ClassicSimilarity], result of:\n          0.12161691 = score(doc=305456,freq=1.0), product of:\n            0.038838807 = queryWeight, product of:\n              0.8 = boost\n              6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                1638.0 = docFreq\n                316313.0 = docCount\n              0.0077520725 = queryNorm\n            3.1313245 = fieldWeight in 305456, product of:\n              1.0 = tf(freq=1.0), with freq of:\n                1.0 = termFreq=1.0\n              6.262649 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                1638.0 = docFreq\n                316313.0 = docCount\n              0.5 = fieldNorm(doc=305456)\n    0.5 = coord(2/4)\n"
      },
      "QParser":"ExtendedDismaxQParser"
}

This query works even with big 'mm' parameter (ex, 90%):

"q":"(somsung~2 majic~2) OR (somsung^3 majic^3)"

But problem here is what I get 430 results (instead of 6 desired). Here is debug with example of wrong product:

"debug":{
      "rawquerystring":"(somsung~2 majic~2) OR (somsung^3 majic^3)",
      "querystring":"(somsung~2 majic~2) OR (somsung^3 majic^3)",
      "parsedquery":"(+((DisjunctionMaxQuery((title_de:somsung~2)) DisjunctionMaxQuery((title_de:majic~2))) (DisjunctionMaxQuery((title_de:somsung))^3.0 DisjunctionMaxQuery((title_de:majic))^3.0))~1 DisjunctionMaxQuery((title_de:\"somsung 2 majic 2 somsung 3 majic 3\")))/no_coord",
      "parsedquery_toString":"+((((title_de:somsung~2) (title_de:majic~2)) (((title_de:somsung))^3.0 ((title_de:majic))^3.0))~1) (title_de:\"somsung 2 majic 2 somsung 3 majic 3\")",
      "explain":{
            "113746":"\n0.1275867 = sum of:\n  0.1275867 = product of:\n    0.2551734 = sum of:\n      0.2551734 = product of:\n        0.5103468 = sum of:\n          0.5103468 = sum of:\n            0.26860356 = weight(title_de:losung in 296822) [ClassicSimilarity], result of:\n              0.26860356 = score(doc=296822,freq=1.0), product of:\n                0.037257966 = queryWeight, product of:\n                  0.6666666 = boost\n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                    635.0 = docFreq\n                    316313.0 = docCount\n                  0.0077520725 = queryNorm\n                7.209292 = fieldWeight in 296822, product of:\n                  1.0 = tf(freq=1.0), with freq of:\n                    1.0 = termFreq=1.0\n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                    635.0 = docFreq\n                    316313.0 = docCount\n                  1.0 = fieldNorm(doc=296822)\n            0.24174322 = weight(title_de:osung in 296822) [ClassicSimilarity], result of:\n              0.24174322 = score(doc=296822,freq=1.0), product of:\n                0.033532172 = queryWeight, product of:\n                  0.6 = boost\n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                    635.0 = docFreq\n                    316313.0 = docCount\n                  0.0077520725 = queryNorm\n                7.209292 = fieldWeight in 296822, product of:\n                  1.0 = tf(freq=1.0), with freq of:\n                    1.0 = termFreq=1.0\n                  7.209292 = idf, computed as log((docCount+1)/(docFreq+1)) + 1 from:\n                    635.0 = docFreq\n                    316313.0 = docCount\n                  1.0 = fieldNorm(doc=296822)\n        0.5 = coord(1/2)\n    0.5 = coord(1/2)\n"
      },
      "QParser":"ExtendedDismaxQParser"
}

So, although I got better results I still need to improve the search and I still do not know which way to choose and why I get such results.

score 0 · Answer 1 · answered Jul 30 '17 at 08:23

0

I think edismax does NOT support fuzzy operator ~. There is a long history patch here that the developer has been using in production for a long time, but it hasn't made it into Solr codebase yet.

answered Jul 30 '17 at 08:23

Persimmonium

15,593
11
47
78

Ok, if fuzzy is not supported for edismax, why in my query the 'q=aple~2 iphane~2' finds right results? Is in interpreted as standard query? – Aronsky Jul 30 '17 at 10:04
use degugQuery=true and let's find out – Persimmonium Jul 30 '17 at 11:47
I updated the description with new experiments and debug results. Looks like edismax really is able to work with fuzzy queries out of box. Now I'm confused even more than before :) – Aronsky Jul 30 '17 at 16:33

score 0 · Answer 2 · answered Sep 08 '18 at 05:34

0

edismax works with fuzzy however when you include mm=90 you are basically saying solr should match 90 exact phrases. That's a high!

Removing that or using a low percentage like 50% would allow some fuzziness to work

answered Sep 08 '18 at 05:34

F.O.O

4,730
4
24
34

SOLR eDismax typo tolerance for phrases

2 Answers2