0

I have a Solr-5.5.1 with the following filters in my field analyzer definition:

    <filter class="solr.MorfologikFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>

It usually works great, but for some words there's a problem, for example with Poznań. It's a city name, but the stemmer recognizes it as a polish noun with the base form poznanie and that's what gets indexed. Now ASCII folding should make sure that when searching for poznan, documents with poznań will match. But poznan is not recognized by stemmer as poznanie, so there is not match.

Any ieas how to resolve this?

My idea for a workaround would be to make stemmer always retain the original token, so that poznań turns into [poznań, poznanie] instead of just [poznanie]. Is there an easy way to achieve this? Is there a reason it doesn't work like this by default? I didn't find anything about it in the javadoc for solr.MorfologikFilterFactory.

Speedstone
  • 383
  • 3
  • 5
  • what are the analyzer at query time ? Can you post relevant parts of your schema.xml? – root Nov 24 '16 at 15:39
  • I have only one analyzer definition, so query analyzer is the same. I even confirmed that when I remove MorfologikFilterFactory in this one place the `poznan->poznań` matching works fine. – Speedstone Nov 25 '16 at 08:27

1 Answers1

0

There's a simple implementation for my workaround idea: make sure the stemmer receives each token along with its ascii-folded form. This can be done with an additional ASCIIFoldingFilterFactory:

    <filter class="solr.ASCIIFoldingFilterFactory" preserveOriginal="true"/>
    <filter class="solr.MorfologikFilterFactory" />
    <filter class="solr.ASCIIFoldingFilterFactory"/>
Speedstone
  • 383
  • 3
  • 5