1

We have a specific use-case for our ElasticSearch instance: we store documents which contain proper names, dates of birth, addresses, ID numbers, and other related info.

We use a name-matching plugin which overrides the default scoring of ES and assigns a relevancy score between 0 and 1 based on how closely the name matches.

What we need to do is boost that score by a certain amount if other fields match. I have started to read up on ES scripting to achieve this. I need assistance on the script part of the query. Right now, our query looks like this:

{  
   "size":100,
   "query":{  
      "bool":{  
         "should":[  
            {"match":{"Name":"John Smith"}}
            ]
         }
   },
   "rescore":{  
         "window_size":100,
         "query":{  
            "rescore_query":{  
               "function_score":{  
                  "doc_score":{  
                     "fields":{
                       "Name":{"query_value":"John Smith"},
                       "DOB":{
                        "function":{
                            "function_score":{
                                "script_score":{
                                    "script":{
                                        "lang":"painless",
                                        "params":{
                                            "query_value":"01-01-1999"
                                                 },
                               "inline":"if **<HERE'S WHERE I NEED ASSISTANCE>**"
                             }
                           }
                         }
                       }
                     }
                   }
                 }
               }
             },
             "query_weight":0.0,
             "rescore_query_weight":1.0
           }
         }

The Name field will always be required in a query and is the basis for the score, which is returned in the default _score field; for ease of demonstration, we'll just add one additional field, DOB, which if matched, should boost the score by 0.1. I believe I'm looking for something along the lines of if(query_value == doc['DOB'].value add 0.1 to _score), or something along these lines.

So, what would be the correct syntax to be entered into the inline row to achieve this? Or, if the query requires other syntax revision, please advise.

EDIT #1 - it's important to highlight that our DOB field is a text field, not a date field.

Stpete111
  • 3,109
  • 4
  • 34
  • 74
  • 1
    Few thoughts off the bat: **(1)** Rescoring _only_ applies to the top `window_size` results - are you sure this is acceptible for your use case? It SOUNDS like you're trying to modify relevance based on presence of other fields, so I'd think you'd want to do that across the entire search space instead of just the top results from your original scoring. **(2)** I don't think you need a script here, as you should just be able to use a list of `filter` functions instead of `script_score` functions that apply a static boost if documents match some criteria. – rusnyder Jul 10 '19 at 14:37
  • Hi @rusnyder - yes we are intentionally only rescoring the top 100 results. And yes, we are trying to modify (boost) the relevance score based on presence of other fields. However, we place the MOST amount of weight on the `name` field: we want to bring back the most relevant `name` matches via the base query, then use the rescore query to check those results for additional fields. FYI, we first tried to solve this using `function_score` and `doc_score` only and using the `weight` parameter. The problem with that is that if the `DOB` did NOT match, it REDUCED the score. We don't want this. – Stpete111 Jul 10 '19 at 15:17
  • 1
    Thanks for clarifying about rescoring, and interesting note regarding your previous attempts. While I'm not sure what you mean by using `doc_score` (unable to find that documented), I do think I have a solution that doesn't require scripting and gets your desired behavior. Effectively, you can use a bool query for your `function_score` query that `should` all your secondary criteria together, then use individual `weight` functions for each criterium to set how much to add to the score for matches. I'll share a complete answer – rusnyder Jul 10 '19 at 16:08
  • Ah, I believe the `doc_score` is proprietary to the name-matching plugin we are using. It's not a well-documented plugin hence your inability to find anything about it. It is probably irrelevant to our discussion in any case. I look forward to your solution. If the `weight` functions do not also REDUCE the score if the additional field doesn't match, then it will work for me. Any tinkering I did with `weight` also reduced the score when the field did not match, which we don't want - we want to boost only. Thanks again. – Stpete111 Jul 10 '19 at 16:20

2 Answers2

2

Splitting to a separate answer as this solves the problem differently (i.e. - by using script_score as OP proposed instead of trying to rewrite away from scripts).

Assuming the same mapping and data as the previous answer, a scripted version of the query might look like the following:

POST /employee/_search
{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Name": "John"
          }
        },
        {
          "match": {
            "Name": "Will"
          }
        }
      ]
    }
  },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": {
        "function_score": {
          "query": {
            "bool": {
              "should": [
                {
                  "match": {
                    "Name": "John"
                  }
                },
                {
                  "match": {
                    "Name": "Will"
                  }
                }
              ]
            }
          },
          "functions": [
            {
              "script_score": {
                "script": {
                  "source": "double boost = 0.0; if (params['_source']['State'] == 'FL') { boost += 0.1; } if (params['_source']['DOB'] == '1965-05-24') { boost += 0.3; } return boost;",
                  "lang": "painless"
                }
              }
            }
          ],
          "score_mode": "sum",
          "boost_mode": "sum"
        }
      },
      "query_weight": 0,
      "rescore_query_weight": 1
    }
  }
}

Two notes about the script:

  1. The script uses params['_source'][field_name] to access the document, which is the only way to get access to text fields. This is significantly slower as it requires accessing documents directly on disk, though this penalty might not be too bad in the context of a rescore. You could instead use doc[field_name].value if the field was an aggregatable type, such as keyword, date, or something numeric
  2. DOB here is compared directly to a string. This is possible because we're using the _source field, and the JSON for the documents has the dates specified as strings. This is somewhat brittle, but likely will do the trick
rusnyder
  • 821
  • 7
  • 16
  • Thanks again for your efforts! I have attempted to run your query and I'm getting a script exception error. It looks like something is up with the syntax. You can see the full error here: https://ibb.co/frMMhKF – Stpete111 Jul 10 '19 at 22:38
  • 1
    What version of ES are you running? (Can’t believe I didn’t ask this sooner!) – rusnyder Jul 10 '19 at 23:00
  • Version 6.4.2 . – Stpete111 Jul 10 '19 at 23:44
  • 1
    Ruh roh. So apparently this problem is exclusive to ES 6.4.x: https://discuss.elastic.co/t/can-not-access-to-params-source-in-script-score/153851. They refactored the script context and inadvertently removed the ability to access `_source` from scripts in 6.4.0. I tested in ES 6.4.2 vs. ES 6.5.2, and while it's broken in 6.4.2 it's been fixed in ES 6.5.2. This means that your options are (1) upgrade ES (2) use only `doc['State'].value`-type access in your script (which may require reindexing as `keyword`, unless fields like `State.keyword` exist already) – rusnyder Jul 11 '19 at 00:21
  • Aww crap. Well I'm glad it was that easy to identify why it's not working. Our plan is to upgrade to 7.2 as soon as possible, but we are waiting for a version of the plugin that is compatible with 7.2 which will be another month or so. I think we are going to have to reindex (I received an error about field type when trying to use `doc.value` so definitely need to reindex). Thank you so much for your help, let's leave this open and I will be back as soon as we have a chance to reindex and I can test it again. – Stpete111 Jul 11 '19 at 13:37
  • It seems I also have the option of setting a parameter of `fielddata=true` on the `DOB` field to be able to use `doc.value`. It warns that it "can use significant memory," but I would assume that it bases that on the assumption that `text` fields can contain a lot of data. Our `DOB` field does not. – Stpete111 Jul 11 '19 at 13:42
  • Oh yeah, missed one! Unfortunately, I'm not sure that will work here. When you access fields in a script via `doc`, it's going to give you back the tokens extracted from that field. This is fine for `keyword`-analyzed fields, since there's only one token, but it's not as simple as `text` fields, which may return multiple tokens. For example, if we have a `DOB` field of `"1965-05-24"`, the default analyzer extracts 3 tokens: `["1965", "05", "24"]`. In `doc['DOB'].values`, their order is non-deterministic. Further, a date like `1970-01-01` only extracts `["1970", "01"]` – rusnyder Jul 11 '19 at 14:09
  • You can use the script source `Debug.explain(doc['DOB'].values)` to explore what that returned data will look like. IFF you can find a way to get whatever search you need done w/ an arbitrarily-sorted array of tokens, then your only other concern is `fielddata` memory consumption, which depends most greatly on the total cardinality of extracted tokens per field. Something like DOB will have VERY low cardinality (only 100 possible 2-digit numbers, or 110 if you allow omission of leading 0), so I wouldn't be too concerned. – rusnyder Jul 11 '19 at 14:13
  • Ok great to know, thank you. So (and I just started to google this, but I'll bet your answer will be more concise and make much more sense to me) other than the tokenization differences you explained above, what other differences are there between `keyword` and `text`? Is the main difference just that `keyword` is treated as a phrase and `text` is broken out to separate words? Any other differences I should consider before re-indexing with `DOB` (and several other fields as well) changed to `keyword` from `text`? – Stpete111 Jul 11 '19 at 15:00
  • Frankly, if you're all already committing to a reindex, ES docs do a better job than me, are more complete, and having familiarity w/ ALL types is a _essential_: https://www.elastic.co/guide/en/elasticsearch/reference/7.2/mapping-types.html. General rules of thumb: (1) Always consider how you'll use/query fields when picking data type, and (2) Pick the most appropriate type per field; (3) Limit data storage (can conflict w/ #1 and #2), and (4) Test your mapping w/ sample data and _real_ queries, then iterate, _before_ committing to reindexing. – rusnyder Jul 11 '19 at 15:36
1

Assuming static weights per additional field, you can accomplish this without using scripting (though you may need to use script_score for any more complex weighting). To solve your issue of directly adding to a document's original score, your rescoring query will need to be a function score query that:

  1. Composes queries for additional fields in a should clause for the function score's main query (i.e. - will only produce scores for documents matching at least one additional field)
  2. Uses one function per additional field, with the filter set to select documents with some value for that field, and a weight to specify how much the score should increase (or some other scoring function if desired)

Mapping (as template)

Adding a State and DOB field for sake of example (making sure multiple additional fields contribute to the score correctly)

PUT _template/employee_template
{
  "index_patterns": ["employee"],
  "settings": {
    "number_of_shards": 1
  },
  "mappings": {
    "_doc": {
      "properties": {
        "Name": {
          "type": "text"
        },
        "State": {
          "type": "keyword"
        },
        "DOB": {
          "type": "date"
        }
      }
    }
  }
}

Sample data

POST /employee/_doc/_bulk
{"index":{}}
{"Name": "John Smith", "State": "NY", "DOB": "1970-01-01"}
{"index":{}}
{"Name": "John C. Reilly", "State": "CA", "DOB": "1965-05-24"}
{"index":{}}
{"Name": "Will Ferrell", "State": "FL", "DOB": "1967-07-16"}

Query

EDIT: Updated the query to include the original query in the new function score in an attempt to compensate for custom scoring plugins.

A few notes about the query below:

  • Setting the rescorers score_mode: max is effectively a replace here, since the newly computed function score should only be greater than or equal to the original score
  • query_weight and rescore_query_weight are both set to 1 such that they are compared on equal scales during score_mode: max comparison
  • In the function_score query:
    • score_mode: sum will add together all the scores from functions
    • boost_mode: sum will add the sum of the functions to the score of the query
POST /employee/_search
{
  "size": 100,
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "Name": "John"
          }
        },
        {
          "match": {
            "Name": "Will"
          }
        }
      ]
    }
  },
  "rescore": {
    "window_size": 100,
    "query": {
      "rescore_query": {
        "function_score": {
          "query": {
            "bool": {
              "should": [
                {
                  "match": {
                    "Name": "John"
                  }
                },
                {
                  "match": {
                    "Name": "Will"
                  }
                }
              ],
              "filter": {
                "bool": {
                  "should": [
                    {
                      "term": {
                        "State": "CA"
                      }
                    },
                    {
                      "range": {
                        "DOB": {
                          "lte": "1968-01-01"
                        }
                      }
                    }
                  ]
                }
              }
            }
          },
          "functions": [
            {
              "filter": {
                "term": {
                  "State": "CA"
                }
              },
              "weight": 0.1
            },
            {
              "filter": {
                "range": {
                  "DOB": {
                    "lte": "1968-01-01"
                  }
                }
              },
              "weight": 0.3
            }
          ],
          "score_mode": "sum",
          "boost_mode": "sum"
        }
      },
      "score_mode": "max",
      "query_weight": 1,
      "rescore_query_weight": 1
    }
  }
}
rusnyder
  • 821
  • 7
  • 16
  • I forgot to mention one very important point. The `DOB` field is currently a `text` field. We have so many different variations, some not even good date formats, that we had to make it a `text` field for now. How would this change your answer, if at all? – Stpete111 Jul 10 '19 at 16:47
  • Wow this is incredibly well-written and detailed answer. I can't wait to see if this works! – Stpete111 Jul 10 '19 at 16:50
  • Hi @rusnyder while I wait to hear if you think we need to revise any part of your query based on the fact that the `DOB` field is of type `text`, I will tell you that I have run your query as-is against my index. The document returned on top is the document I expect (both name and DOB match exactly) but the score is being returned as 107.014 as opposed to what I would expect it to be - 1.03. I do know that when `query_weight` is anything other than zero while using this plugin, it is allowing the base query score (TF/IDF) to be part of the final score calculation, which we don't want. We... – Stpete111 Jul 10 '19 at 17:06
  • ...only want the score provided by the plugin (plus our additional field scores) to be the calculated score. I am wondering also if I need that `doc_score` parameter for this to work correctly, as I believe that tells the plugin to consider all of the following fields when coming up with the score. Maybe I'm overthinking it. – Stpete111 Jul 10 '19 at 17:07
  • Ok, if I change `boost_mode` to `sum`, and `query_weight` to `0`, I so far get the score I would expect. I will keep experimenting. – Stpete111 Jul 10 '19 at 17:19
  • Sorry, in my third comment above, the expected score should say 1.3, not 1.03. – Stpete111 Jul 10 '19 at 17:20
  • Ok, no the scoring definitely isn't working as expected. With my above-mentioned changes, I have several of the top documents being scored at 1.3, and then the rest being scored at zero. I think without a full understanding of how this name-matching plugin works, it may be difficult to assist me. I do know that the scripting is definitely supported in the plugin, using the syntax I provided in my original post. Perhaps this is why the developers of the plugin suggest this approach. – Stpete111 Jul 10 '19 at 17:24
  • 1
    I see, and sorry for submitting an answer that didn't work! I've updated the query to now include the original `Name` query as part of the function score, which I'm _hoping_ plays more nicely with the custom scoring plugin. If that doesn't work, I'll craft up a new answer to assist with a scoring script. – rusnyder Jul 10 '19 at 19:53
  • 1
    Regarding the `DOB` field being indexed as `text`, I'd first counter with: Do yourself a favor (if possible!) and index it as a `date` instead! If that's not possible, then changes to my query would depend on what I was trying to accomplish. If I _really_ needed to do date math on a `text` field, the only option is running `scripts` on the document `_source`, which is _generally_ a really bad idea, but probably not terrible in a rescorer that is only running on hundreds of docs. In a script, I'd parse the date as a `LocalDateTime` and go from there. – rusnyder Jul 10 '19 at 19:56
  • Thanks for the comments and all your help so far! I will try your new query shortly. Regarding `DOB`, yes, we do plan to reindex all of our data with the `DOB` field as type `date`. The issue right now is we have to clean our date data first. In the meantime, for simplicity's sake, let's just assume that the date passed in the query is a string, the date in the document is a string, and the two strings must match exactly to get the score boost. So in other words, we could probably use something other than DOB for our example if we wanted to just to demonstrate that the syntax is working. – Stpete111 Jul 10 '19 at 20:41
  • Hi again, unfortunately the revised query in your answer still isn't working. With the `query_weight` set to 1, the score returned on the document I am testing with is 106.14. If it's set to zero, the top document is scored at 0.3. We may need to look at the script approach as maybe this is all the plugin supports. The syntax I have in my original post comes from the plugin documentation, so I think it's right. – Stpete111 Jul 10 '19 at 21:02
  • FYI, the documentation also states that in all cases, for the the name-matching score to be properly honored in the final results, `query_weight` should always be zero, and `rescore_query_weight` should always be 1. – Stpete111 Jul 10 '19 at 21:08