1

I am trying to create a alphabetic browse of names (personal and institutional) using range queries that will sort without regard to punctuation or capitalization, but even though the analysis tool in Solr suggests that punctuation in queries should be stripped out correctly, the presence of punctuation in the query still negatively affects the results.

from schema.xml:

<fieldType name="sort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
  <analyzer>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="-" replacement=" "/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\p{Punct}¿¡「」]" replacement=""/>
    <charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
    <tokenizer class="solr.KeywordTokenizerFactory"/>
    <filter class="solr.ICUFoldingFilterFactory"/>
    <filter class="solr.LowerCaseFilterFactory" />
    <filter class="solr.TrimFilterFactory" />
  </analyzer>
</fieldType>

<field name="authorSort" type="sort" indexed="true" stored="true" multiValued="false" required="true"/>

from solrconfig.xml:

<requestHandler name="/authors" class="solr.SearchHandler">
<lst name="defaults">
  <str name="defType">lucene</str>
  <str name="echoParams">explicit</str>
  <str name="fl">*</str>
  <str name="df">authorSort</str>
  <str name="sort">authorSort asc</str>
  <str name="rows">20</str>
  <str name="wt">ruby</str>
  <str name="indent">true</str>
</lst>
</requestHandler>

My actual queries look like this:

http://myserver/solr/testCore/authors?q=["Search String" TO *]

When I search for q=["ACA" TO *], my top result is "ACA (Academy of Certified Archivists)", which is good. If I vary the capitalization used in "ACA" my results don't change, which is also good. If I search for the acronym with periods (q=["A.C.A." TO *]) I don't get the appropriate results at all, and my top hit is "A3 (Musical group)". In this case I suspect that it's sorting on the period rather than dropping it.

According to the analysis tool in Solr, both "ACA" and "A.C.A." should be rendered down to "aca" using the analyzer I've configured. I'm at a loss to explain why these two searches aren't effectively equivalent.

(If it makes any difference, the index-time analysis is effectively useless as my code is doing the same conversion before submitting the data to be indexed. There are reasons for that. So it's only the query-time analysis that is giving me grief.)

Edit: Here's a screenshot of how my analysis of "A.C.A." as a query should be working (according to the Solr analysis tool).

Edit: Here's a screenshot of how my analysis of "A.C.A." as a query should be working (according to the Solr analysis tool).

Added about four months later:

Since posting the question and not finding a resolution, I have switched to using a custom filter factory for the analysis. This gave me control over the analysis that would have been difficult or impossible given the provided filters. My first attempt had the same problem - the analysis worked in regular search but wasn't applied in range queries. This problem was resolved by adding implements MultiTermAwareComponent to my filter factory and overriding getMultiTermComponent(). I have no idea what this does for a field which is using the KeywordTokenizer and therefore never has multiple terms in a field value... but it did fix the problem. This was for Solr 4.2.

frances
  • 1,232
  • 10
  • 17
  • I don't see where you are expecting the periods to be removed in your analyzer chain. It is not clear what the second pattern replace filter is doing. Try entering `"A.C.A."` with the double quotes in the analysis tool and see what it outputs. – arun Jan 14 '15 at 00:11
  • The second pattern replace filter is supposed to be replacing anything that matches the Java character class of "Punct" (plus a few other punctuation characters that have given me trouble in the past) with the empty string "". I may not have explained it very clearly, but I did run "A.C.A." through the analysis tool and it did turn into "aca". And it is the second pattern replace filter where the periods disappeared. I'll see if I can add a screenshot of the analysis output to the question. – frances Jan 14 '15 at 14:30
  • Strange that PRCF is giving you those empty tokens (Greek upper-case phi) and the token position starts at 0 and ends at 6, and not at 3 as you would expect. – arun Jan 14 '15 at 18:51
  • I don't know the strange display for the two PRCF entries means, and I suspect that that's part of the problem I'm having. The start and end positions I think are normal though. I checked with a standard text field and they seem to reflect the character offsets in the original string. – frances Jan 14 '15 at 19:24
  • I tried simplifying the pattern in the second PRCF from `[\p{Punct}¿¡「」]` to `\p{Punct}` and finally to `\.` in the hope that the regex complexity was the problem, but it didn't help and simply reduced the types of punctuation that appear to be filtered in the analysis tool. – frances Jan 14 '15 at 19:48
  • I've run into the same problem. The analyzer is being ignored for range queries. – Charles May 20 '15 at 15:40
  • @Charles - I added a postscript to the question that explains what I did to get around the issue, though I never actually solved the original problem. – frances May 27 '15 at 14:43

0 Answers0