2

Can someone explain or link to an explanation about how counting the cardinality of a set with HLL can be used for time series analysis?

I'm pretty sure druid.io does exactly this, but I'm looking for a general explanation of how to do this with HLL alone, without any specific library / database or specific HLL implementation.

A Naive way of doing that would be by prefixing a timestamp on the things we are counting. E.g., using redis HLL API as an example, if you are counting events, starting from second 1000001 up to second 1000060:

PFADD SOMEHLLVAR "1000001-event1" "1000001-event2" ...
PFADD SOMEHLLVAR "1000002-event1" "1000002-event3" ...
PFADD SOMEHLLVAR "1000003-event2" "1000003-event3" ...

# Get count of occurrences of event1 in a minute long range:
PFCOUNT "1000001-event1" -> 1    
PFCOUNT "1000002-event1" -> 1   
PFCOUNT "10000..-event1" -> ..   
PFCOUNT "1000060-event1" -> 0    
...add all numbers!      -> 2

Just one of the problems this would have is that you would need to iterate through each second in a given range to find out, say, the count of specific events in the last minute.

Ria
  • 10,237
  • 3
  • 33
  • 60
Emmanuel Oga
  • 354
  • 2
  • 12

1 Answers1

0

Using the hyperUnique aggregator in Druid requires a bit of coordination between the ingestion side and the query side.

On the ingestion side, in your list of aggregators, you need to include a "hyperUnique" aggregator where the fieldName matches the dimension you wish to eventually run unique counts over. This creates a new metric that contains HLL "sketches". When your data is ingested and queryable, you use the same "hyperUnique" aggregator on the query side to query for the metric you ingested. You can try out a timeseries query (http://druid.io/docs/latest/TimeseriesQuery.html)

BTW, check out groups.google.com/forum/#!forum/druid-development for more questions about HLL and druid.

  • 2
    I'm trying to understand how to implement a time series aggregation using HLL in general, but I mentioned druid because it is an example of a project that does this. I'm looking for a general explanation of how to do this with HLL alone, without any specific library or database. – Emmanuel Oga Apr 08 '14 at 21:47
  • These resources might help - https://www.youtube.com/watch?v=Hpd3f_MLdXo - http://metamarkets.com/2012/fast-cheap-and-98-right-cardinality-estimation-for-big-data/ – user3512891 Apr 08 '14 at 22:02
  • Ah! if I get it right from that video, you guys **do** store an HLL sketch **per record** (so if the granularity of the data is 1 second, you have 1 HLL sketch per second). You manage to deal with the storage requirements by splitting the storage across multiple partitions using some sharded storage solution (like s3). – Emmanuel Oga Apr 08 '14 at 23:03