0

Use Case

Persisting Kafka messages to S3 using Apache Storm

Story so far

  • I tried using secor(https://github.com/pinterest/secor), works fine, serves the purpose. But it may be too much of a maintenance overkill as per Manager (who as they say is always right)
  • We have Apache Kafka-Apache Storm stable cluster in place already so planning to leverage that infra.

Agenda and Problem

  • The messages from Kafka will be batched in Storm bolt and will be written to local disk in a file

  • After certain interval and / or size criteria it will be uploaded to S3

  • To manage failures, each bolt should be able to keep track of Kafka partition and offset ideally per tuple as bolts will be distributed randomly across the cluster.

  • Partition / Offsets can be persisted to Zookeeper, but in the first place how to obtain them from Tuple in a bolt ? Any other way than forwarding them to bolt from Kafka Spout?

Albatross
  • 669
  • 7
  • 24

2 Answers2

0

Kafka spout already tracking offsets of topics in zookeeper, so you don't need to implement this logic in bolt.

Kafka spout will emmit tuple and topology will track it. When tuple acknoledged by bolts it passed through. Spout will consider the tuple delivered. Behind emmiting tuples spout will track current offset in zookeeper, so if something goes wrong, you can start read messages not from beggining.

Topology described above will garantee deliver at least once. With trident topology you can garantee deliver exactly once. In both cases look at topology.max.spout.pending setting. It's crucial to set it right because you going to use batching.

f1sherox
  • 349
  • 2
  • 16
  • Yes. The spout will track offset for sure. But when the file is being written, there is a need to track "last offset written" and "last offset uploaded". Let's say current offset as per spout is Coff. At any given time Coff >= last offset Written >= last offset uploaded. For tracking last two, which are not always same as Coff it is needed. – Albatross May 05 '16 at 21:25
  • What message delivering garantee you wanna get? At least once or exactly once? – f1sherox May 06 '16 at 15:57
  • At least once should be ok. – Albatross May 06 '16 at 17:06
  • So if you are ok with at least once garantee, why you wnat to track "last offset written"? – f1sherox May 10 '16 at 10:28
  • Ok. let's now track that. What about solving rest of the problem ? – Albatross May 12 '16 at 18:55
0

Configure the KafkaSpout with org.apache.storm.kafka.StringMessageAndMetadataScheme which will add the offset and partition to the Spouts emitted values

sodwyer
  • 68
  • 1
  • 7