my solr 4.1.0 installation does not find anything with phonetic encoding. The excerpts from schema.xml:
<field name="textsuggest" type="text_suggest" indexed="true" stored="true" omitNorms="true" multiValued="true" />
<field name="textphon" type="text_phonetic_do" indexed="true" stored="true" omitNorms="true" omitTermFreqAndPositions="false" multiValued="true" />
<copyField source="textsuggest" dest="textphon"/>
...
<fieldType name="text_phonetic_do" class="solr.TextField"
positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory" />
<filter class="solr.GermanNormalizationFilterFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="lang/synonyms_de.txt"
ignoreCase="true" expand="false" />
<filter class="solr.PhoneticFilterFactory" encoder="ColognePhonetic" inject="false" />
</analyzer>
</fieldType>
text_suggest
is more or less a lowercased version of the original text, tokenized with solr.StandardTokenizerFactory
and solr.WordDelimiterFilterFactory
. The phonetic encoder is one specialized for German words. The synomym filter processes some domain specific words.
I was inspired by http://www.cominvent.com/2012/01/25/super-flexible-autocomplete-with-solr/.
I index an entry with "Geprüfter Betriebswirt" and other items in textsuggest. Now when I search for "Betriebswirt" I get expected results. However searching for "Betribswirt" which is just a minor misspelling of the original German word, solr reports 0 hits.
In the analyze view of solr's admin gui I tried different spellings of "Betriebswirt" and my field type text_phonetic_do
, and they all get encoded to the same number stream:
- betriebswirt => 12718372
- betribswirt => 12718372
- betribswiiirt => 12718372
- petribswiert => 12718372
So the encoding (analyze time and search time) works as expected. But as said above, solr does not find any document when searching for the phonetic variant.
I use the query view and even the query textphon:Betriebswirt
doesn't return a single result. The full query result (I stripped the timing part) looks like this:
<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="debugQuery">true</str>
<str name="indent">true</str>
<str name="q">textphon:Betriebswirt</str>
<str name="wt">xml</str>
</lst>
</lst>
<result name="response" numFound="0" start="0">
</result>
<lst name="debug">
<str name="rawquerystring">textphon:Betriebswirt</str>
<str name="querystring">textphon:Betriebswirt</str>
<str name="parsedquery">textphon:12718372</str>
<str name="parsedquery_toString">textphon:12718372</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>
</lst>
</response>
I don't know why it doesn't find anything. If I understand the debug output correctly the index even gets searched for the right (read: phonetically encoded) token.
So what am I missing? Can anybody point me in the right direction? Thanks