To achieve some degree of fault tolerance with Solr I have started to use the NGramFilterFactory
. Here are the intersting bits from the schema.xml
:
<field name="text" type="text" indexed="true" stored="true"/>
<copyField source="text" dest="text_ngram" />
<field name="text_ngram" type="text_ngram" indexed="true" stored="false"/>
<fieldType name="text_ngram" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory" />
<filter class="solr.NGramFilterFactory" minGramSize="3" maxGramSize="3" />
</analyzer>
</fieldType>
I am using the EDisMax
query handler with pretty much the stock configuration. Here are the interesting lines from the solrconfig.xml
:
<requestHandler name="/browse" class="solr.SearchHandler">
<lst name="defaults">
<!-- Query settings -->
<str name="defType">edismax</str>
<str name="qf">
name name_ngram^0.001
</str>
<str name="mm">100%</str>
<str name="q.op">AND</str>
...
This works fine however gives me lots of irrelevant results. Using Solr's analyze capabilities I think I've tracked down the issue to the following cause:
The query is broken down into NGrams. Then Solr searches for either the tokenized query in the text
field or one of the NGrams in the text_ngram
field. Using debug=query
will print out the following parsedquery
when searching for "something":
(+DisjunctionMaxQuery(((text_ngram:som text_ngram:ome text_ngram:met text_ngram:eth text_ngram:thi text_ngram:hin text_ngram:ing) | text:something)))/no_coord
If I read this right it means that either
- One of the NGrams needs to match or
- The original query (tokenized) needs to match
Now this will also find items like "ethernet" as one of the NGrams (eth
) is the same.
My question is: How can I set a higher threshold for the NGram matches? Is there a way to say "only return the item if at least 90% of the NGrams from the query match"? Making sure that 100% of the NGrams match would not make sense as this would effectively kill the fault tolerance.
Another way I thought of was to return only results that are above a certain score threshold relative to the top result. This is because the item "something" will have a very high relevancy compared to "ethernet". So is there a way to hook into Solr to return only results that have eg. at least 1/100th of the score of the top result? I read that there is a way to provide a custom HitCollector
but I couldn't really find any info on this.
Thanks!