0

We use Solr 5.4 and have some text fields defined as text_de with following schema.xml

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
    <analyzer>
      <tokenizer class="solr.StandardTokenizerFactory"/>
      <filter class="solr.LowerCaseFilterFactory"/>
      <filter class="solr.StopFilterFactory" words="lang/stopwords_de.txt" format="snowball" ignoreCase="true"/>
      <filter class="solr.GermanNormalizationFilterFactory"/>
      <filter class="solr.GermanLightStemFilterFactory"/>
    </analyzer>
</fieldType>

which is default configuration. I wonder why a search for name:Rosewein has no results, but name:Roséwein returns related entries. So I tried to query field name with some special chars and enabled option debugQuery which results in:

{
  "responseHeader": {
    "status": 0,
    "QTime": 0,
    "params": {
      "debugQuery": "true",
      "indent": "true",
      "q": "name:ÁÀÂÄÃåĀĂÆæöüßéèêíóú",
      "_": "1459935371889",
      "wt": "json"
    }
  },
  "response": {
    "numFound": 0,
    "start": 0,
    "docs": []
  },
  "debug": {
    "rawquerystring": "name:ÁÀÂÄÃåĀĂÆæöüßéèêíóú",
    "querystring": "name:ÁÀÂÄÃåĀĂÆæöüßéèêíóú",
    "parsedquery": "name:aaaaãåāăææousséèêiou",
    "parsedquery_toString": "name:aaaaãåāăææousséèêiou",
    "explain": {},
    "QParser": "LuceneQParser",
...

have a look at field parsedquery which shows, that not all variants are replaced with ASCII representation. I cannot use ASCIIFoldingFilterFactory as filter, because then german umlauts can get lost, because in some cases they are converted from ü to ue and so on.

But what I can't understand: why are íóúá converted to ioua but not é which is kept as é?

And: is there a way to convert all these special vocals to their ASCII representation, but allow to be umlauts converted to ae Ae ue Ue and so on? (Without having to recompile Solr)

rabudde
  • 7,498
  • 6
  • 53
  • 91
  • can you try German2 with SnowballPorterFilterFactory...? – Abhijit Bashetti Apr 06 '16 at 10:43
  • As one can read here https://lucene.apache.org/core/5_4_0/analyzers-common/index.html?org/apache/lucene/analysis/de/GermanNormalizationFilter.html souldn't Solr use German2 already? – rabudde Apr 06 '16 at 11:28

1 Answers1

1

If you are looking for custom character mapping rules, you can use MappingCharFilterFactory, which takes a config file with the rules. Techproducts example schema showcases it (commented out, so drops out after first modification). Check mapping-FoldToASCII.txt and mapping-ISOLatin1Accent.txt

Alexandre Rafalovitch
  • 9,709
  • 1
  • 24
  • 27