Partitioning aggregates with groups

Question

I'm trying to partition an aggregate similar to the example in the ElasticSearch documentation, but am not getting the example to work.

The index is populated with event-types:

public class Event
{
    public int EventId { get; set; }
    public string SegmentId { get; set; }
    public DateTime Timestamp { get; set; }
}

The EventId is unique, and each event belongs to a specific SegmentId. Each SegmentId can be associated with zero to many events.

The question is: How do I get the latest EventId for each SegmentId?

I expect the number of unique segments to be in the range of 10 millions, and the number of unique events one or two magnitudes greater. That's why I don't think using top_hits by itself is appropriate, as suggested here. Hence, partitioning.

Example:

I have set up a demo-index populated with 1313 documents (unique EventId), belonging to 101 distinct SegmentId (i.e. 13 events per segment). I would expect the query below to work, but the exact same results are returned regardless of which partition number I specify.

POST /demo/_search
{
  "size": 0,
  "aggs": {
    "segments": {
      "terms": {
        "field": "segmentId",
        "size": 15,                  <-- I want 15 segments from each query
        "include": {
          "partition": 0,            <-- Trying to retrieve the first partition
          "num_partitions": 7        <-- Expecting 7 partitions (7*15 > 101 segments)
        }
      },
      "aggs": {
        "latest": {
          "top_hits": {
            "size": 1,
            "_source": [
              "timestamp",
              "eventId",
              "segmentId"
            ],
            "sort": {
              "timestamp": "desc"
            }
          }
        }
      }
    }
  }
}

If I remove the include and set size to a value greater than 101, I get the latest event for every segment. However, I doubt that is a good approach with a million buckets...

score 1 · Answer 1 · answered Apr 12 '17 at 14:03

You are trying to do a Scroll of the aggregation.

Scroll API is supported only for search queries and not for aggregations. If you do not want to use the Top Hits, as you have stated, due to a huge number of documents, you can either try:

Parent/Child approach - where you create segments as a parent document and the events in the child document. And everytime you add a child, you can update the timestamp field in the parent document. By doing so, you can just query the parent documents and you will have your segment id + the last event timestamp
Another approach would be you try to get the top hits only for the last 24 hours. So you can add a query to first filter the last 24 hours and then try to get the aggs using the top_hit.

You are correct in that what I wanted was a scroll on an aggregation, which is not supported. However, I solved it with partitioning (see my accepted answer). Thank you for your suggestions, though! They might come in handy in another situation! (: — Reyhn, Apr 26 '17 at 12:37

score 1 · Accepted Answer · answered Apr 26 '17 at 12:34

1

It turns out I was investigating the wrong question... My example actually works perfectly.

The problem was my local ElasticSearch node. I don't know what went wrong with it, but when repeating the example on another machine, it worked. I was, however, unable to get partitioning working on my current ES installation. I therefore uninstalled and reinstalled ElasticSearch again, and then the example worked.

To answer my original question, the example I provided is the way to go. I solved my problem by using the cardinality aggregate to get an estimate on the total number of products, from which I derived a suitable number of partitions. Then I looped the query above for each partition, and added the documents to a final list.

answered Apr 26 '17 at 12:34

Reyhn

997
1
11
22

Hi Reyhn, I am trying to do the same without sorting, and it seems that on small data sets (tests) i get missing aggregations. I am using version 6.5. Is this api of include exclude using partition is deterministic ? – Ehud Lev Jan 20 '19 at 18:00
I just discovered this pagination option on aggregations recently. But have been wondering how will it perform. I plan to use both as 'nested' 'terms' aggregations instead of top_hits on millions of documents. – User3518958 Apr 14 '21 at 05:15

Partitioning aggregates with groups

2 Answers2