Solr: Can't search for numbers mixed with characters

Question

I have some items in my index (Solr. 4.4), which contain names like Foobar 135g, where the 135g refers to some weights. Searching for foobar or foobar 135 does work, but when I try to search for the exact phrase foobar 135g, nothing is found.

I analysed the query inside the solr admin panel "Analysis". Here everything looks good. The fields are indexed correctly, the query is splitted correctly, and I get hits (indicated by this purple background on the tokens).

But there has to be an issue the way I process the strings on index and/or query time. So this is the field definition, I'm using:

<fieldType name="text" class="solr.TextField" omitNorms="false">
  <analyzer type="index">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
    <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="30"/>
    <filter class="solr.ReverseStringFilterFactory" />
  </analyzer>
  <analyzer type="query">
    <tokenizer class="solr.StandardTokenizerFactory"/>
    <filter class="solr.StandardFilterFactory"/>
    <filter class="solr.WordDelimiterFilterFactory" catenateWords="1" catenateAll="1" preserveOriginal="1"/>
    <filter class="solr.LowerCaseFilterFactory"/>
  </analyzer>
</fieldType>

I'm using the two ReverseStringFilterFactory's with the EdgeNGramFilterFactory's to be able to search for foob and for bar or obar (strings that appear at the end of an item name). First I thought, it has something to do with the WordDelimiterFilterFactory and the catenateWords options. But this option doesn't do anything with numbers in it (am I right?).

After reading the documentation (http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters) I found generateNumberParts which default is 1. This leads to splitting 135g into 135 and g. But as long as I have the preserveOriginal option enabled, the 135g is also indexed as a whole string. This is also shown in the Analysis panel from the admin interface:

Analysis Panel solr Admin Interface: WDF (WordDelimiterFilterFactory)

Does anybody know what kind of filter, tokenizer... is causing this issue?

UPDATE

I've found out something interesting. When I debug the query for the search 135g, I get the following debug output:

<lst name="debug">
  <str name="rawquerystring">name_texts:135g</str>
  <str name="querystring">name_texts:135g</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>
  <lst name="explain"/>
  <str name="QParser">LuceneQParser</str>
  ...
</lst>

I understand, that because of the earlier mentioned solr.WordDelimiterFilterFactory, the string get's splitted into this parts. But why is Solr converting it into a MultiPhraseQuery? I'm a little bite confused right now, I thought that every single token generated by the solr.WordDelimiterFilterFactory on query time would trigger a seperated search (or at least, a OR statement between the tokens).

Please, someone clear up my mind, I'm kinda confused ;) How can I avoid this?

Arun · Accepted Answer · 2014-01-06T13:43:58.893

It is the WordDelimiterFilterFactory. You should be able to see it in your admin panel under analysis. To not do that use : splitOnNumerics="0" as attribute.

Update:

Read more about it here: http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters.

solr.WordDelimiterFilterFactory

Creates solr.analysis.WordDelimiterFilter.

Splits words into subwords and performs optional transformations on subword groups. By default, words are split into subwords with the following rules:

splitOnNumerics="1" causes alphabet => number transitions to generate a new part [Solr 1.3]: "j2se" => "j" "2" "se" default is true ("1"); set to 0 to turn off

Update 2

Based on your latest comment, i now understood what you meant. I took your field type definition and indexed on solr4.5.1 with your sentence and was able to search for test_mytext:"foobar 135g" , test_mytext:foobar 135g, test_mytext:foobar 135g , test_mytext:foobar , test_mytext:135g, test_mytext:135. where test_mytext is of type you defined in your question above. So i do not know why you are unable to find in your own index. Make sure your field is defined some thing like this: <field name="text" type="mytext" indexed="true" stored="true"/>

Upadate 3 Here is my debug log, with your field definition, not sue why you are seeing completely different processing: Query => test_mytext:135g debug": { "rawquerystring": "test_mytext:135g", "querystring": "test_mytext:135g", "parsedquery": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "parsedquery_toString": "test_mytext:135g test_mytext:135 test_mytext:g test_mytext:135g", "explain": { "200": "\n0.8563627 = (MATCH) product of:\n 1.141817 = (MATCH) sum of:\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.4336574 = (MATCH) weight(test_mytext:135 in 1) [DefaultSimilarity], result of:\n 0.4336574 = score(doc=1,freq=3.0 = termFreq=3.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.94313055 = fieldWeight in 1, product of:\n 1.7320508 = tf(freq=3.0), with freq of:\n 3.0 = termFreq=3.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.35407978 = (MATCH) weight(test_mytext:135g in 1) [DefaultSimilarity], result of:\n 0.35407978 = score(doc=1,freq=2.0 = termFreq=2.0\n), product of:\n 0.45980635 = queryWeight, product of:\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.13194223 = queryNorm\n 0.77006286 = fieldWeight in 1, product of:\n 1.4142135 = tf(freq=2.0), with freq of:\n 2.0 = termFreq=2.0\n 3.4849067 = idf(docFreq=2, maxDocs=36)\n 0.15625 = fieldNorm(doc=1)\n 0.75 = coord(3/4)\n" },

I am using solr 4.5.1 .

Update 4 Then i noticed that you are using Solr 4.4.0. I took your exact field definition and phrase and ran a query and it finds your result.

Query => name_texts:"135g"

Result:

<result name="response" numFound="1" start="0">
  <doc>
    <str name="id">100</str>
    <str name="name_texts">Foobar 135g</str>
    <long name="_version_">1456487722571005952</long></doc>
</result>
<lst name="debug">
  <str name="rawquerystring">name_texts:"135g"</str>
  <str name="querystring">name_texts:"135g"</str>
  <str name="parsedquery">MultiPhraseQuery(name_texts:"(135g 135) (g 135g)")</str>
  <str name="parsedquery_toString">name_texts:"(135g 135) (g 135g)"</str>

Your processing looks correct and it find result in my instance. I first thought you had extra , but looks like is not causing issue in my local instance. The best place to look for these issues is to use the admin analysis page and debug queries, which you are already doing. I can not think of any thing else as i am unable to reproduce. Do yourself a favor by just taking a clean instance of solr with only change to schema.xml for your field definition and index just this through admin panel (documents) => {"id":"100","name_texts":"Foobar 135g"} . Run this query http://localhost:8983/solr/collection1/select?q=name_texts%3A%22135g%22&wt=xml&indent=true&debugQuery=true

can you explain a little bit more, what is going wrong with the `WordDelimiterFactory`? — 23tux, Jan 02 '14 at 14:24
cool, thanks! But I'm still wondering, why I have a hit (see the image in my original question), when searching for `135g`, and the phrase `135g` gets splitted into `135`, `g` and `135g`, so I thought it should match my query — 23tux, Jan 02 '14 at 14:29
thanks for your update. Is there something missing in **Update 2** at the last sentence "...some think like this:" ? — 23tux, Jan 04 '14 at 14:14
And I'm still unable to search for the mentioned queries. Is there a way to analyse SOLR's query more? I tried it with the analyser panel from the admin interface, but it seems that there everything works fine. So, are there any other methods I can debug through the whole query? — 23tux, Jan 04 '14 at 14:15

Solr: Can't search for numbers mixed with characters

1 Answers1

Linked