1

In my java daemon application, I have been reading events from Kafka topic with more than 100 partitions across multiple severs (with High Level Consumer Group). So I need to aggregate event count per minute per event name and flush that to time series database. Please note event timestamp can be out of order, and can be behind way behind current time from consumer. The event format is as following:

Timestamp (in ms but showing in text for readability purpose )            event          count
yyyy/moth/day HH:mm:ss
2015/01/01 00:03:35                         E2          100
2015/01/01 00:01:35                         E1          200
2015/01/01 00:00:35                         E2          300
2015/01/01 00:01:27                         E2          700
2015/01/01 00:00:23                         E2          400
2015/01/01 00:00:30                         E1          500
2015/01/01 00:00:50                         E1          600

I have to do the pre-aggregation before I storage engine (count be store in any Time Series Database).

I would store following aggregated in storage engine (floor(timestamp) minute):

2015/01/01 00:03:00 E2  100
2015/01/01 00:01:00 E1  200
2015/01/01 00:01:00 E2  700
2015/01/01 00:00:00 E1  1100
2015/01/01 00:00:00 E2  100

I have evaluated code hale metric and also statsD, (graphite collectD (is not option) but problem with all those library is they aggregate events real-time which is not possible . So I was thinking about using LRUConcurrentHashMap as data structure to hold counts and every minute flush this map to storage. I will also have to keep LRU structure intact for 1 hours or so because data count be late due to lag or behind or out of order.

Do you know any open source library does this or any better approach to aggregate and flush?

Cœur
  • 37,241
  • 25
  • 195
  • 267
Bmis13
  • 550
  • 1
  • 8
  • 27
  • 1
    Do you know how much drift exists? Systems like this usually have a normal distribution centered along the "real" time. 90% of the stuff usually ends up there within +-10min of the current time. So your cache will work nicely. Also how expensive is a lookup+write to your time series DB? Any persistence guarantees you need (what happens if the server goes down before the flush)? – Thomas Jungblut Jan 19 '15 at 23:04
  • @ThomasJungblut, For now it is not big concern since I can do this failure by buffering on disk when DB is down. It term of drift ( I am assuming it could be big as 8 hrs) I have since since people have configured the system clock with timezone (eg UTC vs PST), but I do not care about this drift, I will only care about storing +-10 hrs (60 data point in top cache). Assuming I have no RAM limitation that is fine. I just need library that consider TS for counting events. I do not see any so far. If you have any reference, please let me know. – Bmis13 Jan 19 '15 at 23:20
  • Please let me know any open source lib that does this (if any). – Bmis13 Jan 23 '15 at 06:50

0 Answers0