0

I want to retrieve results which match tokens (edgeNGramed). It works as expected for tokens which do not share prefixed. But for tokens which are sharing prefixes, Solr doesn't work as expected. Eg: if the indexed term is bird box and query is bird b, solr will return results which only has bird and maybe with tokens bird box following it (Since our index is huge, I haven't verified it yet)

Query Construction

titlePhrasalFielName:"bird b"~2

Solr Version - 7.7.1

Link to Analyzer Chain's Response

Here is my analyzer chain.

    <fieldType name="payloadPhrasal" class="solr.TextField">
      <analyzer type="index">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms_payload.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="30"/>
      </analyzer>
      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>
      <similarity class="com.apple.its.uss.solrcomponents.PayloadSimilarity"/>
    </fieldType>

Any thoughts on how to make sure content with bird box gets recalled with bird b before any other content which only has bird.?

Note

  • Already seen stackoverflow question to index token in different field, we do not want to follow that approach and index can grow too huge and we token length can be literally anything.
  • We just moved from Solr 4.10 to Solr 7.7.1, the behaviour is same in both versions. Haven't explored Solr 8 for this use case.
  • Do not want to store all the prefixes (space squashed) in a multivalued field. eg: b, bi, bir, bird,birdb, birdbo & birdbox as this results in over recall for some very common use cases. Meaning it breaks more cases than it actually solves.
user3440050
  • 9
  • 1
  • 5
  • Try the ComplexPhraseQueryParser with `inOrder=true` as my initial guess is that the `~2` may be what's giving you the results that shouldn't be included. It'd be helpful with the output from `debug=all` as well to see why `bird` is recalled above `bird box`. – MatsLindh Jul 30 '19 at 19:20
  • No luck, tried it. We need the `~2` to take care of synonyms (not added as part of solr analyzer but some other sources). – user3440050 Aug 01 '19 at 06:02

0 Answers0