I am using the german language analyzer to tokenize some content. I know that it is basically a macro filter for "lowercase","german_stop", "german_keywords", "german_normalization", "german_stemmer".
My problem has to do with the nomalization filter. Here is the Elasticsearch Documentation and the Lucene Implementation of the filter. The problem is that ae ue and oe are treated as the german letters ä,ö and ü and therefore transformed to a,o,u.
The second transformation is good but the first leads to more problems than it solves. There is usually no ae,ue,oe in german texts that really represents ä, ü, ö. Most of the times they actually appear are inside foreign words, derived from latin or english like 'Aearodynamik' (aerodynamics). The filter then interprets 'Ae' as 'Ä' then transforns it to 'A'. This yields 'arodynamik' as token. Normally this is not a problem since the search word is also normalized with that filter. This does however become a problem if combined with wildcard search:
Imagine a word like 'FooEdit', this will be tokenized to 'foodit'. A search for 'edit OR *edit*' (which is my normal search when the user searches for 'edit') will not yield a result since the 'e' of 'edit' got lost. Since my content has a lot of words like that and people are searching for partial words it's not as much of an edge case as it seems.
So my question is is there any way to get rid of the 'ae -> a' transformations? My understanding is that this is part of the German2 snowball algorithm so probably this can't be changed. Does that mean I would have to get rid of the whole normalization step or can I provide my own version of the snowball algorithm where I just strip the parts that I don't like (didn't find any documentation on how to use a custom snowball algorithm for normalization)?
Cheers
Tom