Elasticsearch: How to score query for range based on array with max/min values

Question

I have many documents containing a rate property which is an array containing min/max range of accepted rates.

{ "rate": [250, 700] }

I now would like to perform queries providing another range, for example:

{
  "bool": {
     "must": [
       "range": {
         "rate": { "from": 100, "to": 500 }
       }
     ]
  }
}

That works fine and always returns values that have at least one of the values provided inside the range which is what I want.

However, for all results, the score is the same. It doesn't matter if the value is the same as on the document or it just hits the range for a few numbers. As shown below:

{
  "_id": "one",
  "_score": 1",
  "_source": { "rate": [250,750] }
},
{
  "_id": "two",
  "_score": 1",
  "_source": { "rate": [200,350] }
},
{
  "_id": "three",
  "_score": 1",
  "_source": { "rate": [500,750] }
}

Is there any way to improve a range search providing another range like this?

score 0 · Answer 1 · answered Jul 24 '16 at 00:45

You're asking for a range, which is implicitly a yes or no question. It's actually weird to even score against it at all beyond anything other than as a booster (as in: if it has then, then boost the score, but if it doesn't have it, then that's okay). As such, range queries tend to be best using in the filter context.

"query": {
  "bool": {
    "filter": [
      {
        "range": {
          "rate": { "gte": 100, "lte": 500 }
        }
      }
    ]
  }
}

(Syntax assumes ES 2.0)

That doesn't really help you, but it is the better way to do the request that you are doing.

As for what you are asking, you want to weight based on the raw value(s) in the document. This is much less straight forward because the value is an array with values that can be potentially out of bounds and it's not a nested object, so it's always treated as an array (meaning you'll need to manually re-exclude the ignored results).

Completely custom scoring requires scripts (native or otherwise), and this can easily be accomplished with a script score.

It doesn't matter if the value is the same as on the document or it just hits the range for a few numbers.

I don't actually understand what the first part means: do you want a single match to "weigh" less or more? Does the distance from the edges matter? Does just matching matter?

I will assume the case that more matches is better, regardless of where they fall in the range:

{
  "query": {
    "bool": {
      "must": {
        "function_score": {
          "functions": [
            {
              "script_score": {
                "script": {
                  "inline": "doc['rate'].values.findAll { it >= gte && it <= lte }.size()",
                  "lang": "groovy",
                  "params": {
                    "gte": 100,
                    "lte": 500
                  }
                }
              }
            }
          ],
          "boost_mode": "replace"
        }
      },
      "filter": [
        {
          "range": {
            "rate": {
              "gte": 100,
              "lte": 500
            }
          }
        }
      ]
    }
  }
}

You should not be using inline Groovy scripts in production (use file based scripts instead), but the above will work.

Thanks, @pickypg, I had actually forgotten to add the range query when I was asking the question (fixed now). Unfortunately, I don't have access to groovy as I am using AWS ES service. Would you think of any other idea that could make it work? I could model the data differently if that helps, but I found that having separate values like `rate_from` and `rate_to` got even more complicated. The idea is that the more it matches an area within the range or rates the higher score it would have (i.e., when providing 100,500, a 100,500 input would be 100% match while 400, 700, not so much. — zanona, Jul 24 '16 at 08:39
why not to use `Groovy`? is there any equivalent solution for `Painless`? — Ami Hollander, Apr 04 '18 at 10:50
Groovy had a lot of security issues with sandboxing. By allowing inline scripts, you open up your cluster to a lot of issues. Additionally, they're compiled on-the-fly (and cached) which is wasteful if a script is never reused (using script `params` enables reuse much like Stored Procedures in SQL databases). Painless borrows some features from Groovy, but it's mostly a subset of Java. So if you can do it in Java, you can do it in Painless. This can definitely be done using Java Streams. This assumes you get the stream, but: `stream().filter(rate -> rate >= gte && rate <= lte).count();`. — pickypg, Apr 09 '18 at 20:45

Elasticsearch: How to score query for range based on array with max/min values

1 Answers1