2

I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol.

For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text".

Could someone explain me why this works in such way and how to fix it ? Here is part of my schema.xml

<fieldType class="solr.TextField" name="text_auto">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" /> 
    </analyzer> 
    <analyzer type="query">
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <tokenizer class="solr.StandardTokenizerFactory"/> 
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    </analyzer>
</fieldType>

<field name="deal_title" type="text_auto" indexed="true" stored="true" required="false" multiValued="false"/>

<fieldType name="text_general" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
Alex Sylka
  • 55
  • 1
  • 1
  • 9

2 Answers2

2

My solution to this in Solr 6.3 (where enablePositionIncrements="false" isn't possible anymore) was to:

  1. remove stopwords
  2. shingle with fillerToken="" (which removes the _)
  3. remove leading and trailing spaced
  4. remove duplicates

    <filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
    <filter class="solr.ShingleFilterFactory" fillerToken=""/>
    <filter class="solr.PatternReplaceFilterFactory" pattern="(^ | $)" replacement=""/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    

Marco
  • 433
  • 5
  • 15
  • This is not correct because the RemoveDuplicatesTokenFilterFactory does not remove tokens with different position. https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-RemoveDuplicatesTokenFilter – Mohammad Chamanpara Aug 31 '18 at 02:21
  • This worked for me in Solr 7.6, though I did add an additional `PatternReplaceFilterFactory` to remove double spaces – eben.english Jan 24 '19 at 18:23
0

To fix this you need to use<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" enablePositionIncrements="false" />and <luceneMatchVersion>4.3</luceneMatchVersion> in solconfig.xml

Okke Klein
  • 2,549
  • 17
  • 9
  • I am using last solr version that's why I have 4.10.3 in my solconfig.xml. Looks like I should downgrade luceneMatchVersion because it doesn't work with current(4.10.3) version. – Alex Sylka Feb 13 '15 at 10:12
  • It doesn't work from Solr 4.4 and up. In Solr5 it will be removed. I'm trying to prevent that. – Okke Klein Feb 13 '15 at 11:01
  • Also I have one more field with type "text_general" (described above) for searching(wildcard) using regex ""/.*"+phrase+".*/"" and it works well but stopwords doesn't work for this field(I think it's because solr.KeywordTokenizerFactory). Could you suggest some other filter ? – Alex Sylka Feb 13 '15 at 13:03
  • You mean stopwords was totally removed starting from version 4.4. ? And there are no any way to implement them in Solr version 4.10.3 without downgrading luceneMatchVersion ? – Alex Sylka Feb 13 '15 at 13:48