3

I'm using the sunspot_rails gem and everything is working perfect so far but: I'm not getting any search results for words with a hyphen.

Example: The string "tron" returns a lot of results(the word mentioned in all articles is e-tron)

The string "e-tron" returns 0 results even though this is the correct word mentioned in all my articles.

My current schema.xml config:

    <fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

What I want: The behaviour for the search string tron is okay of course, but I also want to have the correct matches for the search string e-tron.

Evo_x
  • 2,997
  • 5
  • 24
  • 40

1 Answers1

8

The problem is that solr.StandardTokenizerFactory is splitting words by hyphens so "e-tron" generates the tokens "e", "tron". Presumably "e" is lost as solr.TextField filters with a minimum token size of 2.

This is one example that would show your specific problem.

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="15" side="front"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" preserveOriginal="1" />
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>
  1. solr.WhitespaceTokenizerFactory will generate tokens on whitespace. ["e-tron"]
  2. solr.WordDelimiterFilterFactory will split on hyphens but also preserve the original word. ["e", "tron", "e-tron"]
Andrew Grimm
  • 78,473
  • 57
  • 200
  • 338
polmiro
  • 1,966
  • 15
  • 22
  • Well, this is an improvement but now I got 156 hits for e-tron and 32 hits for tron - That can't be right :( – Evo_x Jul 31 '13 at 20:28
  • "e-tron" will look for both "e-tron" and "tron" so more results can potentially be found this way. Does this give you clues? I can't say more without knowing what results are you getting. – polmiro Jul 31 '13 at 21:28
  • Ok that's a good clue. Maybe we have a misunderstanding here: The search string "e-tron" only needs to look for articles with the word "e-tron"(No need to search for "tron" - That would be a bonus). The only thing I need is: Every search for "e-tron" finds everything with "e-tron" and every search for "tron" finds everything for "tron" and "e-tron" - I hope you know what I mean. Thanks for all the help so far. – Evo_x Aug 01 '13 at 00:41
  • 2
    Then just remove the WordDelimiterFilterFactory. If you look at the documentation it explicitly says it splits by "-". http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory – polmiro Aug 01 '13 at 04:09