3

I am currently calculating the average for a single dimension in a Druid data source using a timeseries query via pydruid. This is based on an example in the documentation (https://github.com/druid-io/pydruid):

from pydruid.client import PyDruid
from pydruid.utils.aggregators import count, doublesum

client = PyDruid()
client.timeseries(
                   datasource='test_datasource',
                   granularity='hour',
                   intervals='2019-05-13T11:00:00.000/2019-05-23T17:00:00.000',
                   aggregations={
                                 'sum':doublesum('dimension_name'),
                                 'count': count('rows')
                                },
                   post_aggregations={
                                      'average': (
                                                  Field('sum')/ Field('count')
                                                 )
                                      }
                  )

My problem is that I don't know what count('rows') is doing. This seems to give the total row count for a datasource and is not filtered on the dimension. So I don't know whether the average will be incorrect if one row in the dimension in question has a null value.

I was wondering whether anyone knew how to calculate the average correctly?

Many thanks

Huw
  • 533
  • 1
  • 7
  • 15

0 Answers0