2

I'm writing a Spark Streaming application reading from Kafka. In order to have an exactly one semantic, I'd like to use the direct Kafka stream and using Spark Streaming native checkpointing.

The problem is that checkpointing makes pratically impossible to mantain the code: if you change something you loose the checkpointed data, thus you are almost compelled to read twice some messages from Kafka. And I'd like to avoid it.

Thus, I was trying to read the data in the checkpointing directory by myself, but so far I haven't been able to do that. Can someone tell me how to read the information about last processed Kafka offsets by the checkpointing folder?

Thank you, Marco

mgaido
  • 2,987
  • 3
  • 17
  • 39
  • Are you using stateful streams? If not, you don't *have to* use checkpointing in your graph, you can simply store the Kafka Offsets. – Yuval Itzchakov Sep 16 '16 at 10:16
  • Yes, I have a state to mantain... – mgaido Sep 16 '16 at 10:18
  • Getting exactly once with stateful streams is tricky. One thing you can do is make sure to serialize the state yourself with a protocol that supports schema evolution, but that would cost you an additional serialization on top of checkpointing the data, which isn't that scalable. – Yuval Itzchakov Sep 16 '16 at 10:20
  • I know. This is the reason why I'm trying to read the information from the checkpoint data stored by Spark... – mgaido Sep 16 '16 at 10:40
  • Spark stores a `ReliableCheckpointRDD` inside your checkpoint directory, not the raw bytes of your state. It's not meant to be externally read. – Yuval Itzchakov Sep 16 '16 at 10:43
  • Thus, it is not feasible to read it... so I can't do what I was trying... thank you... – mgaido Sep 16 '16 at 10:47

0 Answers0