5

I need to compute a pipeline aggregation in ElasticSearch and I can't figure out how to express it.

Each document has an email address and an amount. I need to output range buckets of amount counts, grouped by unique email.

{ "0 - 99": 300, "100 - 400": 100 ...}

Would basically be the expected output (the keys would be transformed in my application code), indicating that 300 unique emails have cumulatively received at least 99 (amount) across all documents.

Intuitively, I would expect a query like below. However, range does not appear to be a buckets aggregation (or allow buckets_path).

What is the correct approach here?

{
 aggs: {
   users: {
     terms: {
       field: "email"
     },
     aggs: {
       amount_received: {
         sum: {
           field: "amount"
         }
       }
     }
   },
   amount_ranges: {
     range: {
       buckets_path: "users>amount_received",
       ranges: [
           { to: 99.0 },
           { from: 100.0, to: 299.0 },
           { from: 300.0, to: 599.0 },
           { from: 600.0 }
       ]
     }
   }
}
  }
Ben
  • 15,010
  • 11
  • 58
  • 90

1 Answers1

8

There's no pipeline aggregation that does that directly. However, I think I came up with a solution that should suit your needs, it goes like this. The idea is to repeat the same terms/sum aggregation and then use a bucket_selector pipeline aggregation for each of the ranges you're interested in.

POST index/_search
{
  "size": 0,
  "aggs": {
    "users_99": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "-99": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived < 100"
          }
        }
      }
    },
    "users_100_299": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "100-299": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 100 && params.amountReceived < 300"
          }
        }
      }
    },
    "users_300_599": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "300-599": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 300 && params.amountReceived < 600"
          }
        }
      }
    },
    "users_600": {
      "terms": {
        "field": "email",
        "size": 1000
      },
      "aggs": {
        "amount_received": {
          "sum": {
            "field": "amount"
          }
        },
        "600": {
          "bucket_selector": {
            "buckets_path": {
              "amountReceived": "amount_received"
            },
            "script": "params.amountReceived >= 600"
          }
        }
      }
    }
  }
}

In the results, the number of buckets in the users_99 will be the number of unique emails that have an amount less than 99. Similarly, users_100_299 will contain as many buckets as there are unique emails with amounts between 100 and 300. And so on...

Val
  • 207,596
  • 13
  • 358
  • 360
  • 1
    Probably the only solution that uses ES entirely (no outside ES steps). I do have some yet-to-be-proved concerns regarding the performance of this bundle of aggregations. BUT, if however is performing, @Ben you are happy with it, then there are no concerns :-). If the performance is affecting your ES usage, maybe consider to do the "split" outside Elasticsearch. – Andrei Stefan Jun 22 '18 at 08:33
  • 1
    Agreed @Andrei, performance might be a concern depending on the amount of data Ben wants to run this query on. We'll see what he says. Besides, it would be nice to create a new `bucket_range` pipeline aggregation, I'll probably file a feature request soon. – Val Jun 22 '18 at 08:35
  • Val, I did consider a solution like this but was hoping there was a more built-in approach. I will definitely give this a try and see if the performance hit is acceptable. Thanks! – Ben Jun 24 '18 at 23:06
  • @Ben were you able to try this out? – Val Jun 28 '18 at 05:45
  • @Val, I was able to try it out. It would be great if I could just derive the document count of each bucket, as opposed returning an array of docs for each. As you point out, pulling all of the records (beyond the arbitrary 1k size you placed) for each bucket would likely create a performance issue. I suppose deriving that count is ultimately not possible. – Ben Jun 28 '18 at 07:58
  • I'm not sure what document count you want to get back. The aggregations are not pulling out an array of docs, but simply aggregating them. – Val Jun 28 '18 at 08:04
  • What cardinality are you expecting for the top-level buckets? – Val Jun 28 '18 at 14:15
  • When I execute the query, I get back an object which looks like this: {"users_600":{"doc_count_error_upper_bound":5,"sum_other_doc_count":170,"buckets": [{"key":"steve@g.com","doc_count":8,"amount_received":{"value":334.0}}] } – Ben Jun 28 '18 at 17:48
  • It would be great if, instead of the buckets array of objects, I could simply get back the cardinality (count of unique email addresses which fall into that bucket). – Ben Jun 28 '18 at 17:51
  • I don't think it's possible since the buckets need to be computed anyway in order for the sub "amount_received" aggregation to be performed and the pipeline bucket selector to kick in. What kind of performance to do you get? – Val Jun 28 '18 at 19:55
  • My main consideration there is just application memory due to all the data in the buckets array. I think I'll be fine in this case due to a unique aspect of my problem. The vast majority of documents will be in the range of 0 - 200. So, I ended up bucketing all of the higher ranges as you outlined above and then subtracting their counts from the total document count to derive the 0 - 200 count. I do not compute / return the 0 - 200 bucket in ES. – Ben Jun 28 '18 at 22:22