Solr and hyphenated numbers

Question

I have a number with hyphens 91-21-22020-4.

My problem is that I would like hits even if the hyphens are moved within the number string. As it's now 912122020-4 will give one hit but 91212202-04 will not?

The debug info looks like:

"debug": {
"rawquerystring": "91212202-04",
"querystring": "91212202-04",
"parsedquery": "+((freetext:91212202 freetext:9121220204)/no_coord) +freetext:04",
"parsedquery_toString": "+(freetext:91212202 freetext:9121220204) +freetext:04",
"explain": {},
"QParser": "LuceneQParser",

AND

"debug": {
"rawquerystring": "912122020-4",
"querystring": "912122020-4",
"parsedquery": "+((freetext:912122020 freetext:9121220204)/no_coord) +freetext:4",
"parsedquery_toString": "+(freetext:912122020 freetext:9121220204) +freetext:4",
"explain": {
  "ATEST003-81419": "\n0.33174315 = (MATCH) sum of:\n  0.17618936 = (MATCH) sum of:\n    0.17618936 = (MATCH) weight(freetext:9121220204 in 0) [DefaultSimilarity], result of:\n      0.17618936 = score(doc=0,freq=1.0), product of:\n        0.5690552 = queryWeight, product of:\n          3.3025851 = idf(docFreq=1, maxDocs=20)\n          0.17230599 = queryNorm\n        0.30961734 = fieldWeight in 0, product of:\n          1.0 = tf(freq=1.0), with freq of:\n            1.0 = termFreq=1.0\n          3.3025851 = idf(docFreq=1, maxDocs=20)\n          0.09375 = fieldNorm(doc=0)\n  0.15555379 = (MATCH) weight(freetext:4 in 0) [DefaultSimilarity], result of:\n    0.15555379 = score(doc=0,freq=2.0), product of:\n      0.44962177 = queryWeight, product of:\n        2.609438 = idf(docFreq=3, maxDocs=20)\n        0.17230599 = queryNorm\n      0.34596586 = fieldWeight in 0, product of:\n        1.4142135 = tf(freq=2.0), with freq of:\n          2.0 = termFreq=2.0\n        2.609438 = idf(docFreq=3, maxDocs=20)\n        0.09375 = fieldNorm(doc=0)\n"
},

My schema.xml looks like:

<fieldType name="text_indexed" class="solr.TextField">
    <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.HyphenatedWordsFilterFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
        <filter class="solr.ReversedWildcardFilterFactory" withOriginal="true" maxPosAsterisk="3" maxPosQuestion="2" maxFractionAsterisk="0.33"/>
    </analyzer>
    <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.WordDelimiterFilterFactory" catenateNumbers="1" catenateAll="0"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-index.txt"/>
    </analyzer>
</fieldType>

score 0 · Answer 1 · answered Feb 09 '16 at 11:50

0

Use a PatternReplaceCharFilter to remove all traces of the hyphens before they're indexed in Solr (or use PatternReplaceFilter to change the tokens stored and not the text indexed).

91212202-04 would then be indexed (and searched) as 9121220204, which would effectively remove any dependency on hyphens.

answered Feb 09 '16 at 11:50

MatsLindh

49,529
4
53
84

But when I run analyze 9121220204 is already indexed and searched for? And it kind of works I just can't figure out why the positioning of the hyphens matters for the search result? – user1245173 Feb 09 '16 at 12:32
Well, the "+freetext:4" on the end of your query is what creates the requirement to match a token that's just 4 or 04 - which makes one hit and the other not. Use the "analysis" page to see the transformation in each step. It's also a good idea to remove filters that you don't need (you have both a hyphenate filter, a mappingcharfilter _and_ a worddelimiterfilter that might be changing the input) before debugging further. – MatsLindh Feb 09 '16 at 12:44
Ok, did the trick.. – user1245173 Feb 09 '16 at 12:56

Solr and hyphenated numbers

1 Answers1