Solr (Open Solr) suggester results contain punctuation marks

Question

I'm working on a suggester and the results I'm gettig back contain punctuation. For example, when I type "Volcan" I get:

"volcanoes", "volcanic", "volcano", "volcano,", <- comma "volcanoes." <- period/full stop

Here is the code in the solrconfig.xml file:

<searchComponent class="solr.SpellCheckComponent" name="suggest">
  <lst name="spellchecker">
    <str name="name">suggest</str>
    <str name="classname">org.apache.solr.spelling.suggest.Suggester</str>
    <str name="lookupImpl">org.apache.solr.spelling.suggest.tst.TSTLookup</str>
    <str name="field">text</str>
    <float name="threshold">0.005</float>
    <str name="buildOnCommit">true</str>
  </lst>
</searchComponent>
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggest">
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <str name="spellcheck">true</str>
    <str name="spellcheck.dictionary">suggest</str>
    <str name="spellcheck.onlyMorePopular">true</str>
    <str name="spellcheck.count">5</str>
    <str name="spellcheck.collate">true</str>
  </lst>
  <lst name="invariants">
      <!-- always run the Suggester for queries to this handler -->
      <str name="spellcheck">true</str>
      <!-- collate not needed, query if tokenized as keyword, we need only suggestions for that term -->
      <str name="spellcheck.collate">false</str>
  </lst>
  <arr name="components">
    <str>suggest</str>
  </arr>
</requestHandler>

In the schema.xml file I have this:

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100" indexed="true" stored="false" multiValued="true" termVectors="true" termPositions="true" termOffsets="true">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.ShingleFilterFactory"
                    minShingleSize="2"
                    maxShingleSize="4"
                    outputUnigrams="true"
                    outputUnigramsIfNoShingles="true"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.TrimFilterFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

And the result is:

{
    "responseHeader": {
        "status": 0,
        "QTime": 0,
        "params": {
            "wt": "json",
            "q": "volcan"
        }
    },
    "spellcheck": {
        "suggestions": [
            "volcan",
            {
                "numFound": 5,
                "startOffset": 0,
                "endOffset": 6,
                "suggestion": [
                    "volcanoes",
                    "volcanic",
                    "volcano",
                    "volcano,",
                    "volcanoes."
                ]
            }
        ]
    }
}

Did you check `text`'s fieldtype ? Ensure it is binded to `textSpell` (or equivalent) and that `textSpell` uses a tokenizer that discard/split on punctuation e.g. `StandardTokenizerFactory`. — EricLavault, Nov 27 '14 at 18:52
I changed WhitespaceTokenizerFactory to StandardTokenizerFactory and it looks much beter! Thanks @n0tting!! — Mark Robson, Nov 28 '14 at 11:08

score 0 · Answer 1 · answered May 06 '15 at 09:37

The problem is not really on your requestHandler... It rather, seems to reside in the way you're indexing the files that go into the spell field, and maybe the spell field it's self. I'm thinking you should enable a tokenizer that strips out the punctuation from those fields.

Here's the spell field definition that works for me in schema.xml

<fieldType name="spell" class="solr.TextField" positionIncrementGap="100">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
    <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
</fieldType>

Hi, thanks for the contribution. However, your schema entry is not too dissimilar to mine, but mine doesn't work — Mark Robson, May 06 '15 at 15:32

Solr (Open Solr) suggester results contain punctuation marks

1 Answers1