In my java daemon application, I have been reading events from Kafka topic with more than 100 partitions across multiple severs (with High Level Consumer Group). So I need to aggregate event count per minute per event name and flush that to time series database. Please note event timestamp can be out of order, and can be behind way behind current time from consumer. The event format is as following:
Timestamp (in ms but showing in text for readability purpose ) event count
yyyy/moth/day HH:mm:ss
2015/01/01 00:03:35 E2 100
2015/01/01 00:01:35 E1 200
2015/01/01 00:00:35 E2 300
2015/01/01 00:01:27 E2 700
2015/01/01 00:00:23 E2 400
2015/01/01 00:00:30 E1 500
2015/01/01 00:00:50 E1 600
I have to do the pre-aggregation before I storage engine (count be store in any Time Series Database).
I would store following aggregated in storage engine (floor(timestamp) minute):
2015/01/01 00:03:00 E2 100
2015/01/01 00:01:00 E1 200
2015/01/01 00:01:00 E2 700
2015/01/01 00:00:00 E1 1100
2015/01/01 00:00:00 E2 100
I have evaluated code hale metric and also statsD, (graphite collectD (is not option) but problem with all those library is they aggregate events real-time which is not possible . So I was thinking about using LRUConcurrentHashMap as data structure to hold counts and every minute flush this map to storage. I will also have to keep LRU structure intact for 1 hours or so because data count be late due to lag or behind or out of order.
Do you know any open source library does this or any better approach to aggregate and flush?