Solr accent removal

Question

i have read various threads about how to remove accents during index/query time. The current fieldtype i have come up with looks like the following:

<fieldType name="text_general" class="solr.TextField">     
    <analyzer>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.ASCIIFoldingFilterFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

After having added a couple of test information to index i have checked via http://localhost:8080/solr/test_core/admin/luke?fl=title

which kind of tokens have been generated. For instance a title like "Bayern München" has been tokenized into:

<int name="bayern">1</int>
<int name="m">1</int>
<int name="nchen">1</int>

Therefore instead of replacing the character by its ascii pendant, it has been interpret as being a delimiter?! Having that kind of index results into that i neither can search for "münchen" nor m?nchen.

Any idea how to fix? Thanks in advance.

JHS · Accepted Answer · 2013-06-18T07:13:38.207

9

The issue is you are applying StandardTokenizerFactory before applying the ASCIIFoldingFilterFactory. Instead you should use the MappingCharFilterFactory character filter factory first and the the StandardTokenizerFactory.

As per the Solr Reference guide StandardTokenizerFactory supports <ALPHANUM>, <NUM>, <SOUTHEAST_ASIAN>, <IDEOGRAPHIC>, and <HIRAGANA>. Therefore when you tokenize using StandardTokenizerFactory the umlaut characters are lost and your ASCIIFoldingFilterFactory is of no use after that.

Your fieldType should be like below if you want to go for StandardTokenizerFactory.

<fieldType name="text_general" class="solr.TextField">     
    <analyzer>
            <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
            <tokenizer class="solr.StandardTokenizerFactory"/>
            <filter class="solr.LowerCaseFilterFactory" />
    </analyzer>     
</fieldType>

The mapping-ISOLatin1Accent.txt should have the mappings for such "special" characters. In Solr this file comes pre-populated by default. For e.g. ü -> ue, ä -> ae, etc.

edited Jun 18 '13 at 07:13

answered Jun 18 '13 at 07:04

JHS

7,761
2
29
53

Thanks for answering. However the MappingCharFilterFactory seems not to become applied. I do update the index via post.jar sending a json file with the content to be added to index. As i have replaced the Standardtokenizer by a WhitespaceTokenizerFactory, the strings aren't tokenized upon the accents, but neither become replaced by the content of mapping-ISOLatin1Accent.txt. – user2148322 Jun 18 '13 at 07:25
If you are using WhitespaceTokenizerFactory then you can use ASCIIFoldingFilterFactory. Like the fieldType you have in your question, just replace StandardTokenizerFactory with WhitespaceTokenizerFactory. – JHS Jun 18 '13 at 07:40
I do have applied to different fieldtypes for title field and content_type field with on the hand having MappingCharFilterFactory and on the other hand the ASCIIFoldingFilter. Both variations still aren't working. http://localhost:8080/solr/test_core/admin/luke?fl=title,content_type 1 In content_type field and 1 for title field (having lowercasefilter) – user2148322 Jun 18 '13 at 07:51
I would have guessed, it fails due to updating the index via post.jar, not having an appropriate encoding. Thatswhy i have added the Dfile.encoding param. However haven't fixed the problem. java -Durl=http://localhost:8080/solr/test_core/update -Dtype=application/json -Dfile.encoding=UTF-8 -jar post.jar *.json – user2148322 Jun 18 '13 at 08:01

Solr accent removal

1 Answers1

Linked