3

Rank beginner at ElasticSearch here.

I have a list of customers, and their orders as a nested field. Assuming a document structure like:

[
  { customerId: 123,
    birthday: 1980-01-01,
    orders: [
      {
        orderValue: 1500,
        orderDate: 2018-12-18T12:18:12Z
      },
      [...]
    ]
  },
  [...]
}

What I'd like to query is: The list of users who ordered for a certain amount from between two dates. And I'd like to be able to combine that with a range query for, for example, birthday.

I've gotten to the point where I can get the sum ordered between two dates per subscriber using aggregations:

{
  "size": 0,
  "aggs": {
    "foo": {
      "nested": {
        "path": "orders"
      },
      "aggs": {
        "grouped_by_customerId": {
          "terms": {
            "field": "orders.customerId.keyword"
          },
          "aggs": {
            "filtered_by_date": {
              "filter": {
                "range": {
                  "orders.orderDate": {
                    "from": "2018-01-28",
                    "to": null,
                    "include_lower": false,
                    "include_upper": true,
                    "format": "yyyy-MM-dd",
                    "boost": 1
                  }
                }
              },
              "aggs": {
                "sum": {
                  "sum": {
                    "field": "orders.orderValue"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

However, I'd like to limit the results I get back in the Query part, to mix better with all our other filters.

My first thought was to have a script filter and pass the bounding dates and minimum value in as parameters, but then I'd have to iterate over a doc's nested documents, and that doesn't seem to work.

Is that last idea possible, and if so, how?

Thanks!

Bones
  • 71
  • 8
  • I would clearly move the filter you have in your aggregation filter in the query part – Val Dec 23 '18 at 07:46
  • Right, that part works fine. However, that is only part of the equation, as I also want to sum the orderValue's within that timeframe and compare them to a parameter. If I could loop over the orders-field (and check the from- and to-dates there as well, obviously) and compare it with my threshold, that would work, but that doesn't seem to be possible... – Bones Dec 23 '18 at 09:33
  • So right now, the aggregation you have brings you the correct sum of `orders.orderValue`, right? and now you just need to compare that sum with some parameter? – Val Dec 23 '18 at 09:34
  • Correct. And preferably in the query, so it can mix-and-match with other selection criteria. I could do it in code, but I want the entire object (which could be quite large, and so could the result set) – Bones Dec 23 '18 at 09:51
  • Watch out for putting too many nested documents under a parent document. Whenever you add or modify a nested document the whole thing (parent and all it's nested documents) gets reindexed. This can cause a performance hit when the number of orders for a given user gets very high. – Radix Salvilines Jun 26 '19 at 13:36

1 Answers1

4

Finally solved this myself, using a Function Score query as follows:

{
  "query": {
    "bool": {
      "must": [
        {
          "function_score": {
            "min_score": 1,
            "query": {
              "nested": {
                "path": "orders",
                "ignore_unmapped": false,
                "score_mode": "min",
                "boost": 1,
                "query": {
                  "range": {
                    "orders.orderDate": {
                      "from": "2018-12-10",
                      "to": null,
                      "include_lower": true,
                      "include_upper": true,
                      "format": "yyyy-MM-dd",
                      "boost": 1
                    }
                  }
                }
              }
            },
            "functions": [
              {
                "filter": {
                  "match_all": {}
                },
                "script_score": {
                  "script": {
                    "source": "ArrayList x = params['_source']['orders'];if (x == null) { return 0 }long result = x.stream().filter(order -> {  if(params.startDate != null && !ZonedDateTime.parse(order.orderDate).isAfter(ZonedDateTime.parse(params.startDate))) return false; return true}).mapToLong(order->Long.parseLong(order.orderValue)).sum();if(params.operator == 'GT') return result > params.totalOrderValue ? 2 : 0;else if (params.operator == 'GE') return result >= params.totalOrderValue ? 3 : 0;else if (params.operator == 'LE') return result <= params.totalOrderValue ? 4 : 0;else if(params.operator == 'LT') return result < params.totalOrderValue ? 5 : 0;return result == params.totalOrderValue ? 6 : 0",
                    "lang": "painless",
                    "params": {
                      "totalOrderValue": 120,
                      "operator": "GE",
                      "startDate": "2012-12-10T23:00:00.000Z"
                    }
                  }
                }
              }
            ],
            "score_mode": "multiply",
            "max_boost": 3.4028235e+38,
            "boost": 1
          }
        }
      ],
      "adjust_pure_negative": true,
      "boost": 1
    }
  }
}

The actual scores this produces are debugging output to test the operator, but a min_score of 1 means any of them match. Using _source is quite slow.

Without a query in the function_score it works, but takes 20 seconds or so to muddle through 3 million records. With the query, you only look at customers with orders that actually match the date range.

Since the painless script processes the entire list of orders, it has to redo the date math. Some optimization to do there, but at least I have a Proof of Concept.

I've seen this question before without a satisfactory answer, so hopefully someone finds this useful.

Bones
  • 71
  • 8
  • nice, thank a lot. I struggled with an equivalent problem. I had to filter matching document on a sum of filtered nested document value criteria (-_-!!), and since _source is not available in filter context, i had no way to do it. Good that the _source is available in score context ! – Pierre Mallet Jun 14 '19 at 08:21
  • @bones which version of ES are you using? params._source used to work in ES 5.6 but no longer works in 6.8 :( – nattfodd Oct 06 '20 at 14:17
  • The regular "boost_mode" multiples the score of the query with the score returned by the function, you might want to use "boost_mode": "replace" to use only the function score https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html#:~:text=is%20multiplied%20(default)-,replace,-only%20function%20score – Ron Serruya Sep 09 '21 at 08:48