I'm working on project that should write via kafka to hdfs. Suppose there is online server that writes messages into the kafka. Each message includes timestamp in it. I want to create a job that the output will be a file/files according to timestamp in messages. For example if the data in kafka is
{"ts":"01-07-2013 15:25:35.994", "data": ...}
...
{"ts":"01-07-2013 16:25:35.994", "data": ...}
...
{"ts":"01-07-2013 17:25:35.994", "data": ...}
I would like to get the 3 files as output
kafka_file_2013-07-01_15.json
kafka_file_2013-07-01_16.json
kafka_file_2013-07-01_17.json
And of course If I'm running this job once again and there is a new messages in queue like
{"ts":"01-07-2013 17:25:35.994", "data": ...}
It should create a file
kafka_file_2013-07-01_17_2.json // second chunk of hour 17
I've seen some open sources but most of them reads from kafka to some hdfs folder. What is the best solution/design/opensource for this problem