2

I hope you can help me, because this problem drives me crazy.

To make it simple I have documents with fields named name_text_de_de which has following content:

name_text_de_de
Industrie-Reiniger
Katalysator-Reiniger
Flächenreiniger
UNIVERSALREINIGER
FELGENREINIGER-GEL

this is not all, but some of it. If I use this query I get these results above: q=name_text_de_de:*reinig but NO result if I use the following query: q=name_text_de_de:*reiniger which does not make sense at all.

what could be the problem here?

Thanks in advance,

Fide

        <fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedStopFilterFactory" managed="de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_de_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <!-- <filter class="solr.DictionaryCompoundWordTokenFilterFactory" dictionary="lang/dictionary_de_de.txt" /> -->
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <!-- <filter class="solr.KeywordRepeatFilterFactory" /> -->
                <filter class="solr.KeywordMarkerFilterFactory" protected="lang/protwords_de_de.txt" />
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German" /> -->
                <!-- <filter class="solr.SnowballPorterFilterFactory" language="German2" /> -->
                <!-- <filter class="solr.GermanStemFilterFactory" /> -->
                <!-- <filter class="solr.GermanLightStemFilterFactory" /> -->
                <filter class="solr.GermanMinimalStemFilterFactory" />
                <!-- <filter class="solr.GermanNormalizationFilterFactory" /> -->
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_spell_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>

        <fieldType name="text_spell_de_de" class="solr.TextField" positionIncrementGap="100">
            <analyzer type="index">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
            <analyzer type="query">
                <tokenizer class="solr.StandardTokenizerFactory" />
                <filter class="solr.ManagedSynonymGraphFilterFactory" managed="de_de" />
                <filter class="solr.ManagedStopFilterFactory" managed="de_de" />
                <!-- <filter class="solr.StopFilterFactory" ignoreCase="true" words="lang/stopwords_de_de.txt" /> -->
                <filter class="solr.LowerCaseFilterFactory" />
                <filter class="solr.RemoveDuplicatesTokenFilterFactory" />
            </analyzer>
        </fieldType>
Fide
  • 109
  • 1
  • 3
  • 8
  • 1
    Have you looked via the Analysis screen (https://solr.apache.org/guide/8_11/analysis-screen.html) how the field is being indexed with these particular values? – Hector Correa Nov 09 '22 at 17:05
  • The other thing that I will try is to using Solr's debug query option (https://solr.apache.org/guide/8_9/common-query-parameters.html#debug-parameter) to see how your particular query is being parsed. The `explain other` option might be useful here. – Hector Correa Nov 09 '22 at 17:07
  • 1
    When you use a wildcard query, analysis does not take place. You have a `GermanMinimalStemFilterFactory` in your indexing chain, which means that the token gets changed - the ending `er` gets removed from `reiniger` - so the actual token stored is `reinig`. So when you do a wildcard query, it looks for tokens ending with `reiniger` - which there is none of, since when you indexed, the token got changed to `reinig`. If you want both behaviors, have multiple fields with different definitions. – MatsLindh Nov 09 '22 at 20:40
  • Hello guys, I added an answer without seeing @MatsLindh closing this post as a duplicate, though I think it deserves an answer which by the way provides a solution to handle both behaviors without having multiple fields. – EricLavault Nov 09 '22 at 21:04
  • @EricLavault That's a very neat technique! Could you add that to the original question as well? That way we have all the relevant techniques in a single location. There's probably questions asked before that one, but that was the one I found through the All Mighty Search engine. – MatsLindh Nov 09 '22 at 21:06
  • maybe this field is mistaken, but this is not a name of a person, but of a product, but from my understanding, is that I cannot use wildcard on this field, when stemming is used, which means I need to use a separate fieldType for all fields which allowes wildcard searches, right? – Fide Nov 10 '22 at 08:10
  • @Fide, no it's not mistaken : in the linked post the OP refers to names of persons so he probably doesn't need stemming at all, while you do with those products names. And secondly, you _**can**_ use wildcard with stemming enabled on this field without having to define a separate field type. – EricLavault Nov 10 '22 at 14:51

1 Answers1

2

The problem is that wildcard queries are not processed through the analysis chain, so your query is not stemmed as the original text.

For example here the token reiniger, which is truncated to reinig by the stem filter at index time, can't match *reiniger (unfiltered) because there is no token ending with "reiniger" in the index.

 Input stream            |  Indexed tokens
-------------------------|--------------------------
 "Industrie-Reiniger"    |  "industri", "reinig"
 "Katalysator-Reiniger"  |  "katalysato", "reinig"
 "Flächenreiniger"       |  "flachenreinig"
 "UNIVERSALREINIGER"     |  "universalreinig"
 "FELGENREINIGER-GEL"    |  "felgenreinig", "gel"

To make wildcards queries and fuzzy search work properly with stemmers (and other filters that may truncate tokens), you need to add the KeywordRepeatFilterFactory before the stemmer in the analysis chain :

Emits each token twice, one with the KEYWORD attribute and once without.

If placed before a stemmer, the result will be that you will get the unstemmed token preserved on the same position as the stemmed one. Queries matching the original exact term will get a better score while still maintaining the recall benefit of stemming. Another advantage of keeping the original token is that wildcard truncation will work as expected.

EricLavault
  • 12,130
  • 3
  • 23
  • 45
  • Thanks Eric, I will definitely try that, as you can see it was commented out in my code. Actually I was following a different idea, what I read, that prefix wildcard is too expensive and I need to reverse it, but that did not work, so I will try this out for sure. – Fide Nov 09 '22 at 22:53
  • actually that did not work. I will try the solution of the "duplicated" question and remove stemming completely – Fide Nov 09 '22 at 23:33
  • It should work. Did you reindex ? You need to reload solr to make the schema updates effective, and to re-index so that the preserved token can be indexed along with the stemmed tokens. – EricLavault Nov 10 '22 at 14:43
  • I did I also unloaded the old collections before re-indexing. No result. – Fide Nov 10 '22 at 17:39
  • It works, I did the tests using your own text samples, queries, and base definition, you must have missed something. Double check the field types, the field names, and if they match the queries (ie. in your post I noticed that both `text_de` and `text_de_de` have the exact same definition, including the commented parts, so maybe you have confused these two, or that the queried field is not of the proper type?). All I can do is add the definition I used but it is literally the same thing as 'text_de[_de]' with a keyword repeat filter.. – EricLavault Nov 10 '22 at 21:54
  • actually it did work, but I was misled by the analyzer as you can see here: https://stackoverflow.com/a/74393454/4030802 I tried the query and got the right results. Thank you very much. – Fide Nov 11 '22 at 00:08