How to use bucket, metric or pipeline in elasticsearch dsl (elasticsearch-dsl-py) for multiple layers of aggregations

Question

I have a raw aggregation script like below, but have hard time to convert it into elasticsearch dsl.

I've read the document and found the description saying that we may use the .bucket(), .metric() and .pipeline() methods for nesting aggregations but lacking information on further illustrating how to use those three for more complex aggregations, like more layers.

{
   "aggs": {
        "statistics": {
             "terms": {
                 "field":"id"
             },
             "aggs":{
                "date":{
                    "date_histogram":{
                        "min_doc_count":0,
                        "field":"date",
                        "interval":"1d",
                        "format":"yyyy-MM-dd"
                    },
                    "aggs":{
                        "column_a":{
                            "avg":{
                                "field":"column_a"
                            }
                        },
                        "column_b":{
                            "avg":{
                                "field":"column_b"
                            }
                        },
                        "column_c":{
                            "avg":{
                                "field":"column_c"
                            }
                        },
                        "a_gap":{
                            "serial_diff":{
                                "buckets_path":"column_a"
                            }
                        },
                        "b_gap":{
                            "serial_diff":{
                                "buckets_path":"column_b"
                            }
                        },
                        "c_gap":{
                            "serial_diff":{
                                "buckets_path":"column_c"
                            }
                        }
                    }
                },
                "sum_a_gap":{
                    "sum_bucket":{
                        "buckets_path":"date>a_gap"
                    }
                },
                "sum_b_gap":{
                    "sum_bucket":{
                        "buckets_path":"date>b_gap"
                    }
                },
                "sum_c_gap":{
                    "sum_bucket":{
                        "buckets_path":"date>c_gap"
                    }
                }
            }
        }
    }
}

My Elasticsearch-dsl query like this makes the 'sum_a_gap' the same level as 'column_a' and 'a_gap'.

self._search.aggs
    .bucket('statistics', 'terms', field='id')
    .bucket('date', 'date_histogram', field='date',
            interval='1d', min_doc_count=0, format='yyyy-MM-dd')
    .metric('column_a', 'avg', field='column_a')
    .metric('column_b', 'avg', field='column_b')
    .metric('column_c', 'avg', field='column_c')
    .pipeline('a_gap', 'serial_diff', buckets_path='column_a')
    .pipeline('b_gap', 'serial_diff', buckets_path='column_b')
    .pipeline('c_gap', 'serial_diff', buckets_path='column_c')
    .pipeline('sum_a_gap', 'sum_bucket', buckets_path='date>a_gap')
    .pipeline('sum_b_gap', 'sum_bucket', buckets_path='date>b_gap')
    .pipeline('sum_c_gap', 'sum_bucket', buckets_path='date>c_gap')

Many thanks in advance!

You should start here to learn about the differences between bucket, metric and pipeline aggregations: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html — Val, Dec 02 '19 at 07:20

score 1 · Answer 1 · answered Dec 03 '19 at 02:02

Eventually, I figured it out. I changed the orders a bit and the outcome is as expected. This would aggregate 'id', 'date' and 'sum_{}_gap' at the same layer and other metrics and pipelines under 'date'.

self._search.aggs
    .bucket('statistics', 'terms', field='id')
    .pipeline('sum_a_gap', 'sum_bucket', buckets_path='date>a_gap')
    .pipeline('sum_b_gap', 'sum_bucket', buckets_path='date>b_gap')
    .pipeline('sum_c_gap', 'sum_bucket', buckets_path='date>c_gap')
    .bucket('date', 'date_histogram', field='date',
            interval='1d', min_doc_count=0, format='yyyy-MM-dd')
    .metric('column_a', 'avg', field='column_a')
    .metric('column_b', 'avg', field='column_b')
    .metric('column_c', 'avg', field='column_c')
    .pipeline('a_gap', 'serial_diff', buckets_path='column_a')
    .pipeline('b_gap', 'serial_diff', buckets_path='column_b')
    .pipeline('c_gap', 'serial_diff', buckets_path='column_c')

Thank you so much for this question and self answer. It's like a tutorial in something I couldn't find anywhere (since I was searching for "sub aggregations" instead of "nest aggregations"). — Noumenon, Sep 12 '21 at 15:07

How to use bucket, metric or pipeline in elasticsearch dsl (elasticsearch-dsl-py) for multiple layers of aggregations

1 Answers1