Use Case
Persisting Kafka messages to S3 using Apache Storm
Story so far
- I tried using secor(https://github.com/pinterest/secor), works fine, serves the purpose. But it may be too much of a maintenance overkill as per Manager (who as they say is always right)
- We have Apache Kafka-Apache Storm stable cluster in place already so planning to leverage that infra.
Agenda and Problem
The messages from Kafka will be batched in Storm bolt and will be written to local disk in a file
After certain interval and / or size criteria it will be uploaded to S3
To manage failures, each bolt should be able to keep track of Kafka partition and offset ideally per tuple as bolts will be distributed randomly across the cluster.
Partition / Offsets can be persisted to Zookeeper, but in the first place how to obtain them from Tuple in a bolt ? Any other way than forwarding them to bolt from Kafka Spout?