Solr - EDisMax - Match Sub-Phrase Exactly

Question

I am querying a simple core of category names, e.g.

JEANS
SKINNY JEANS
BOOT CUT JEANS
SHOES
...

I typically use EDisMax. I would like the user query, for example:

BLUE SKINNY JEANS

to match only exact categories. So in the above case only the following should match:

SKINNY JEANS
JEANS

I'm using Solr 5.3.1. I tried to implement the category "name" field as a string type, and I query with the following params:

"params": {
      "q": "SKINNY JEANS",
      "defType": "edismax",
      "indent": "true",
      "qf": "name",
      "pf": "name",
      "pf3": "name",
      "wt": "json",
      "pf2": "name",
      "lowercaseOperators": "true",
      "debugQuery": "true",
      "stopwords": "true",
      "_": "1464079436985"
    }

but only JEANS is ever matched. I cannot, for the life of me, get SKINNY JEANS to match.

I am getting more familiar with Solr's analysers, I tried generating the following type as a way to get around the problem:

fieldType name="text_phrase" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

      <analyzer type="query">
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.TrimFilterFactory" />
        <filter class="solr.LowerCaseFilterFactory"/>
      </analyzer>

    </fieldType>

I.e. using a KeywordTokenizerFactory to index the category name without tokenizing, but tokenizing the query in conjunction with EDisMax's pf/pf2/pf3 fields, but this does not work either. I don't think shingles are a solution here, and PositionFilterFactory appears deprecated.

How do I EDisMax query a large string for a smaller substring?

Thank you,

score 4 · Answer 1 · answered May 25 '16 at 08:23

The pf/pf2/pf3 parameters only come into play to rearrange the order of the results. They must first match the query. That means you cannot use them to drop any results, just to promote the best results to the top. If you want to drop the results first, you need to use other methods (e.g. an mm parameter). Unfortunately, that's a hard problem as Solr does not know what user means and what fields are 'compulsory' for that particular query. Some of this has been discussed in a series of articles by Ted Sullivan, specifically the ones about Query Autofiltering.

Additionally, the pf/pf2/pf3 in your example are used just as a plain field names, without weights. Which means they are not actually indicating the priorities. You would want to use something like this instead:

  "pf":"name^10",
  "pf3":"name^3",
  "pf2":"name^2",

When used correctly, you should see the phrases showing up in debug (with debugQuery flag enabled):

"+((name:blue) (name:skinny) (name:jeans)) ((name:\"blue skinny jeans\")^10.0) (((name:\"blue skinny\")^2.0) ((name:\"skinny jeans\")^2.0)) ((name:\"blue skinny jeans\")^3.0)"

score 1 · Answer 2 · answered May 24 '16 at 09:27

Since your query side is tokenizing the input value, any query will be broken into separate tokens, which then will be matched against the stored value.

In the case of 'SKINNY JEANS', this will be kept as one single token in the index (SKINNY JEANS), while when you're searching, the string is broken into separate tokens - so it's trying to match BLUE, SKINNY and JEANS. Neither of these tokens match SKINNY JEANS (as one single, large token).

Shingles could work (at least better than your current solution), as that would end up with BLUE_SKINNY, SKINNY_JEANS as tokens, depending on your field configuration. Remember that all these cases will end up with JEANS SKINNY as not generating any match.

I'm guessing you can solve this with a shingle factory on query, and insert proper separators when indexing. The query would have BLUE, SKINNY, JEANS, BLUE_SKINNY, SKINNY_JEANS as the values to query for, while SKINNY_JEANS would be the indexed value - meaning you'll get a match (the default separator is ' ', so you should be good to go by inserting the shingle factor as the last step in the query chain.

If you look at the query parameters I pasted you'll see that the pf/pf2/pf3 params are set. I.e. the query should implicitly shingle. Here is my query debug output, it appears that it is being shingled, and yet it is not matching skinny jeans (all lowercase). "querystring": "skinny jeans", "parsedquery": "(+(DisjunctionMaxQuery((name:skinny)) DisjunctionMaxQuery((name:jeans))) DisjunctionMaxQuery((name:\"skinny jeans\")) DisjunctionMaxQuery((name:\"skinny jeans\")))/no_coord", "parsedquery_toString": "+((name:skinny) (name:jeans)) (name:\"skinny jeans\") (name:\"skinny jeans\")", — mils, May 24 '16 at 09:45
thanks for the suggestion, I've spent all day looking at it and I think you're right, but I'm having some trouble with implementation. Can you tell me more about proper separators? Is there an issue with using whitespace? Thanks — mils, May 25 '16 at 09:08

Solr - EDisMax - Match Sub-Phrase Exactly

2 Answers2