I want to retrieve results which match tokens (edgeNGramed). It works as expected for tokens which do not share prefixed. But for tokens which are sharing prefixes, Solr doesn't work as expected. Eg: if the indexed term is bird box
and query is bird b
, solr will return results which only has bird
and maybe with tokens bird box
following it (Since our index is huge, I haven't verified it yet)
Query Construction
titlePhrasalFielName:"bird b"~2
Solr Version - 7.7.1
Link to Analyzer Chain's Response
Here is my analyzer chain.
<fieldType name="payloadPhrasal" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms_payload.txt" ignoreCase="true" expand="true"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="30"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<similarity class="com.apple.its.uss.solrcomponents.PayloadSimilarity"/>
</fieldType>
Any thoughts on how to make sure content with bird box
gets recalled with bird b
before any other content which only has bird
.?
Note
- Already seen stackoverflow question to index token in different field, we do not want to follow that approach and index can grow too huge and we token length can be literally anything.
- We just moved from Solr 4.10 to Solr 7.7.1, the behaviour is same in both versions. Haven't explored Solr 8 for this use case.
- Do not want to store all the prefixes (space squashed) in a multivalued field. eg:
b
,bi
,bir
,bird
,birdb
,birdbo
&birdbox
as this results in over recall for some very common use cases. Meaning it breaks more cases than it actually solves.