3

I'd like to ensure that searching for, say, I.B.M. can be found by searching for ibm. I'd also like to make sure that Dismemberment Plan could be found by searching for dismember.

Using Solr, what tokenizer and filters can I use in analysis and query time to permit both kinds of results?

Carson
  • 17,073
  • 19
  • 66
  • 87
  • I'd start by using the "DisMax" query parser. For novice users, it's much more friendly. Not sure if that'll help at all with the specific cases you've raised, however. – Frank Farmer Oct 11 '11 at 19:33

1 Answers1

9

For I.B.M. => ibm
you would need a solr.WordDelimiterFilterFactory, which would strip special chars and catenate word and numbers

catenateWords="1" would catenate the words and transform I.B.M to IBM.

Dismemberment => dismember
Need to include a stemmer filter (e.g. solr.PorterStemFilterFactory, solr.EnglishMinimalStemFilterFactory) which would index the roots of the words and provide matches for words which have the same roots.

In addition you can use solr.LowerCaseFilterFactory for case insensitive matches (IBM and ibm), solr.ASCIIFoldingFilterFactory for handling foreign characters.

You can always use SynonymFilterFactory to map words which you think are synonyms.

you can apply this at both query and index time, so that they match and convert during both and the results are consistent.

e.g. field type def -

<fieldType name="text_en_splitting" class="solr.TextField" positionIncrementGap="100" autoGeneratePhraseQueries="true">
  <!-- Index and Query time -->
  <analyzer type="index">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
        <!-- Stemmer -->
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.WhitespaceTokenizerFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.PorterStemFilterFactory"/>
  </analyzer>
</fieldType>

http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters

Jayendra
  • 52,349
  • 4
  • 80
  • 90
  • thanks for this. very helpful. fwiw, the stemmber doesn't appear to be picking up `dismember`. – Carson Oct 11 '11 at 23:19
  • Dismemberment is converted to root dismember by PorterStemFilterFactory, tested with Solr 3.3. You can check on the analysis for the example field type configuration mentioned above. PorterStemFilterFactory is a agressive stemmer and you may have odd results with this as well. You can use synonyms filter if you want to map words which you think are similar. – Jayendra Oct 12 '11 at 05:51