I have some items in my index (Solr. 4.4), which contain names like Foobar 135g
, where the 135g refers to some weights. Searching for foobar
or foobar 135
does work, but when I try to search for the exact phrase foobar 135g
, nothing is found.
I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).
But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:
<fieldType name="text" class="solr.TextField" omitNorms="false">
<analyzer type="index">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
<filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
<filter class="solr.ReverseStringFilterFactory" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>
I'm using the two ReverseStringFilterFactory
's with the EdgeNGramFilterFactory
's to be able to search for foob
and for bar
or obar
(strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory
and the catenateWords
options. But this option doesn't do anything with numbers in it (am I right?).
After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts
which default is 1
. This leads to splitting 135g
into 135
and g
. But as long as I have the preserveOriginal
option enabled, the 135g
is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:
Does anybody know what kind of filter, tokenizer... is causing this issue?
UPDATE
I've found out something interesting. When I debug the query for the search 135g
, I get the following debug output:
<lst name="debug">
<str name="rawquerystring">name_texts:135g</str>
<str name="querystring">name_texts:135g</str>
<str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
<str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
<lst name="explain"/>
<str name="QParser">LuceneQParser</str>
...
</lst>
I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory
, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery
? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory
on query time would trigger a seperated search (or at least, a OR
statement between the tokens).
Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?