0

I wan't to achieve the following group by aggregation using elasticsearch. The SQL equivalent would be like (sort of):

SELECT MIN(unix_time) as min_utime, MAX(unix_time) as max_utime
FROM myindex
GROUP BY http_user, src

Would the aggregations be something like this?

"aggs":{
    "suser":{
        "terms":{
            "field":"http_user"
        },
        "aggs":{
            "src":{
                "terms":{
                    "field":"src"
                },
                "aggs":{
                    "max_utime":{
                        "max":{
                            "field":"unix_time"
                        }
                    },
                    "min_utime":{
                        "min":{
                            "field":"unix_time"
                        }
                    }
                }
            }
        }
    }            
}

I am not sure I understood the aggregations thing correctly. Am I right?

created test to simulate the solution. I extract the aggregations in the form of list[dict(str, T)] for each bucket name and key or metric and value

for suser in aggregations.get("suser").get("buckets"):
    for src in suser.get('src').get('buckets'):
        row = dict(suser=suser.get('key'), src=src.get('key'),
                   min_utime=src.get('min_utime').get("value"),
                   max_utime=src.get("max_utime").get("value"))

        results_array.append(row)

first_row = results_array[0]
http_user = first_row.get('suser')
src = first_row.get('src')

new_query = {
    'query': {
        'bool': {
            'must': [
                {"term":{'http_user': http_user}},
                {"term":{'src': src}}
            ],
            "filter":{
                "range":{
                    "@timestamp":{
                         'gte': datetime.datetime(2016, 10, 3, 10, 22, 44, 569755), 
                         'lte': datetime.datetime(2016, 10, 3, 14, 22, 44, 569755)
                }
            }
        }
    }
}

sort = ["unix_time:asc"]

results = []
for hit in scan(es_client, index="my_index", query=new_query, sort=sort)
        results.append(hit)

first_hit = results[0].get('_source')

self.assertEqual(first_row.get("min_utime"),
                 first_hit.get("unix_time"))

The test randomly succeeds and fails. Also the result times are not equal (although the date times I provide are always the same as you can see from the query both the aggregation and the new query.)

Test results:

AssertionError: 1475489579.0 != 1475489620
AssertionError: 1475489961.0 != 1475489980
Apostolos
  • 7,763
  • 17
  • 80
  • 150
  • 1
    Sounds perfectly good. Have you tried to execute it? – Val Sep 30 '16 at 10:56
  • Yes, It works, thought there are so many results back that I haven't figured a way to verify if the results are valid. But thanks for letting me know :) – Apostolos Sep 30 '16 at 11:00
  • 1
    The easiest way to figure it out is to take one `http_user` and `src`, build a `bool/filter/term` query out of it and then sort by `unix_time` `asc` and then `desc` and that should match with the `min`, respectively `max` value you're getting for that bucket. – Val Sep 30 '16 at 11:10
  • I tried your solution but I seem to get different numbers. The query:new_query = { 'sort':{ "unix_time":"asc" }, 'query': { 'bool': { 'must': [ {"term":{'http_user': http_user}}, {"term":{'src': src}} ] } } }. http_user and src are two values I picked randomly from a group(belonging in the same group). But when checking min_utime and unix_time they are not the same which they should. – Apostolos Oct 03 '16 at 11:00
  • Please update your question with the numbers you get from your last query and the ones you get in the aggregation query. – Val Oct 03 '16 at 11:02

0 Answers0