I am trying to create a alphabetic browse of names (personal and institutional) using range queries that will sort without regard to punctuation or capitalization, but even though the analysis tool in Solr suggests that punctuation in queries should be stripped out correctly, the presence of punctuation in the query still negatively affects the results.
from schema.xml:
<fieldType name="sort" class="solr.TextField" sortMissingLast="true" omitNorms="true">
<analyzer>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="-" replacement=" "/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="[\p{Punct}¿¡「」]" replacement=""/>
<charFilter class="solr.PatternReplaceCharFilterFactory" pattern="\s+" replacement=" "/>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.ICUFoldingFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldType>
<field name="authorSort" type="sort" indexed="true" stored="true" multiValued="false" required="true"/>
from solrconfig.xml:
<requestHandler name="/authors" class="solr.SearchHandler">
<lst name="defaults">
<str name="defType">lucene</str>
<str name="echoParams">explicit</str>
<str name="fl">*</str>
<str name="df">authorSort</str>
<str name="sort">authorSort asc</str>
<str name="rows">20</str>
<str name="wt">ruby</str>
<str name="indent">true</str>
</lst>
</requestHandler>
My actual queries look like this:
http://myserver/solr/testCore/authors?q=["Search String" TO *]
When I search for q=["ACA" TO *]
, my top result is "ACA (Academy of Certified Archivists)", which is good. If I vary the capitalization used in "ACA" my results don't change, which is also good. If I search for the acronym with periods (q=["A.C.A." TO *]
) I don't get the appropriate results at all, and my top hit is "A3 (Musical group)". In this case I suspect that it's sorting on the period rather than dropping it.
According to the analysis tool in Solr, both "ACA" and "A.C.A." should be rendered down to "aca" using the analyzer I've configured. I'm at a loss to explain why these two searches aren't effectively equivalent.
(If it makes any difference, the index-time analysis is effectively useless as my code is doing the same conversion before submitting the data to be indexed. There are reasons for that. So it's only the query-time analysis that is giving me grief.)
Edit: Here's a screenshot of how my analysis of "A.C.A." as a query should be working (according to the Solr analysis tool).
Added about four months later:
Since posting the question and not finding a resolution, I have switched to using a custom filter factory for the analysis. This gave me control over the analysis that would have been difficult or impossible given the provided filters. My first attempt had the same problem - the analysis worked in regular search but wasn't applied in range queries. This problem was resolved by adding
implements MultiTermAwareComponent
to my filter factory and overridinggetMultiTermComponent()
. I have no idea what this does for a field which is using theKeywordTokenizer
and therefore never has multiple terms in a field value... but it did fix the problem. This was for Solr 4.2.