Solr stop words replaced with _ symbol

Question

I have problems with solr stopwords in my autosuggest. All stopwords was replaced by _ symbol.

For example I have text "the simple text in" in field "deal_title". When I try to search word "simple" solr show me next result "_ simple text _" but I expect "simple text".

Could someone explain me why this works in such way and how to fix it ? Here is part of my schema.xml

<fieldType class="solr.TextField" name="text_auto">
    <analyzer type="index">
        <charFilter class="solr.HTMLStripCharFilterFactory"/>
        <tokenizer class="solr.StandardTokenizerFactory"/>
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <filter class="solr.ShingleFilterFactory" maxShingleSize="3" outputUnigrams="true" outputUnigramsIfNoShingles="false" /> 
    </analyzer> 
    <analyzer type="query">
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 
        <tokenizer class="solr.StandardTokenizerFactory"/> 
        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" />
    </analyzer>
</fieldType>

<field name="deal_title" type="text_auto" indexed="true" stored="true" required="false" multiValued="false"/>

<fieldType name="text_general" class="solr.TextField">
  <analyzer type="index">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="false"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

score 2 · Answer 1 · answered Jan 10 '17 at 10:22

2

My solution to this in Solr 6.3 (where enablePositionIncrements="false" isn't possible anymore) was to:

remove stopwords
shingle with fillerToken="" (which removes the _)
remove leading and trailing spaced

remove duplicates

<filter class="solr.StopFilterFactory" format="snowball" words="lang/stopwords_de.txt" ignoreCase="true"/>
<filter class="solr.ShingleFilterFactory" fillerToken=""/>
<filter class="solr.PatternReplaceFilterFactory" pattern="(^ | $)" replacement=""/>
<filter class="solr.RemoveDuplicatesTokenFilterFactory"/>

answered Jan 10 '17 at 10:22

Marco

433
5
15

This is not correct because the RemoveDuplicatesTokenFilterFactory does not remove tokens with different position. https://lucene.apache.org/solr/guide/6_6/filter-descriptions.html#FilterDescriptions-RemoveDuplicatesTokenFilter – Mohammad Chamanpara Aug 31 '18 at 02:21
This worked for me in Solr 7.6, though I did add an additional `PatternReplaceFilterFactory` to remove double spaces – eben.english Jan 24 '19 at 18:23

score 0 · Answer 2 · answered Feb 12 '15 at 13:44

0

To fix this you need to use<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true" enablePositionIncrements="false" />and <luceneMatchVersion>4.3</luceneMatchVersion> in solconfig.xml

answered Feb 12 '15 at 13:44

Okke Klein

2,549
17
9

I am using last solr version that's why I have 4.10.3 in my solconfig.xml. Looks like I should downgrade luceneMatchVersion because it doesn't work with current(4.10.3) version. – Alex Sylka Feb 13 '15 at 10:12
It doesn't work from Solr 4.4 and up. In Solr5 it will be removed. I'm trying to prevent that. – Okke Klein Feb 13 '15 at 11:01
Also I have one more field with type "text_general" (described above) for searching(wildcard) using regex ""/.*"+phrase+".*/"" and it works well but stopwords doesn't work for this field(I think it's because solr.KeywordTokenizerFactory). Could you suggest some other filter ? – Alex Sylka Feb 13 '15 at 13:03
You mean stopwords was totally removed starting from version 4.4. ? And there are no any way to implement them in Solr version 4.10.3 without downgrading luceneMatchVersion ? – Alex Sylka Feb 13 '15 at 13:48

Solr stop words replaced with _ symbol

2 Answers2

Linked