0

Spark streaming provides 2 kind of streams when integrating with kafka

  1. Receiver Based
  2. Direct

What kind of stream structured streaming uses when we do spark.readstream.format("kafka")?

Abhinav Kumar
  • 210
  • 3
  • 13
  • Neither? At least, not something to be directly concerned with – OneCricketeer Mar 22 '22 at 14:37
  • We need to know right , as for Receiver based, we need to set WAL settings and for direct we need to setup checkpointing for offset maintenance. Correct me if I am wrong here. – Abhinav Kumar Mar 22 '22 at 14:59
  • 1
    Structured Streaming doesn't expose any WAL settings that I'm aware of, and it can store offsets back into Kafka itself. Checkpoints are maintained from the Dataframe api – OneCricketeer Mar 22 '22 at 15:05
  • If it maintains offset into kafka itself then what is the need of checkpointing in structured streaming ? – Abhinav Kumar Mar 22 '22 at 15:55
  • Checkpoints are for Spark executor state management for exactly-once processing since by default, Kafka is at-least-once delivery. For example, if an executor fails, but then is retried on another node, then a checkpoint is used to recover the state – OneCricketeer Mar 22 '22 at 17:30
  • I think both exactly-once processing and offset maintenance are same. Offset also does the same thing, same data should not be processed twice. Also I can clearly see that under my checkpointing folder, spark has created offset folder. – Abhinav Kumar Mar 22 '22 at 17:38
  • Not exactly. Say you commit offset 100 to Kafka. The reader is configured to read 100 more. It starts at offset 100 and reads 10 messages, and fails, but makes a checkpoint; it fails, so doesn't commit offset 110 back to Kafka. The checkpoint does have that, though. – OneCricketeer Mar 22 '22 at 17:43
  • Ok got it. One last thing before we close this discussion. So where does spark structured streaming with kafka stores its message offsets ? is it kafka's consumer offset topic or checkpointing location ? – Abhinav Kumar Mar 22 '22 at 17:50
  • Could be both if checkpoints are enabled. – OneCricketeer Mar 22 '22 at 17:54
  • and if checkpoint is not enabled( though default is true) then by default to kafka only ? – Abhinav Kumar Mar 22 '22 at 17:58
  • That's correct. `groupIdPrefix` or `kafka.group.id` is used to create a consumer group where offsets are stored – OneCricketeer Mar 22 '22 at 18:43
  • Does spark streaming uses checkpoint only when there is failover ? Else it uses kafka consumer_offset for offset management ? – Abhinav Kumar Mar 23 '22 at 12:05
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/243255/discussion-between-abhinav-kumar-and-onecricketeer). – Abhinav Kumar Mar 23 '22 at 20:12
  • If you set `.option("checkpointLocation"`, then that path is updated at runtime, and read whenever the executors restart or scale, yes. Kafka is an implementation detail as any structured stream can use checkpoints. – OneCricketeer Mar 23 '22 at 22:17
  • my question is ---> when I read data from kafka topics to spark streaming, then how offsets are managed does spark streaming ( being consumer here ) maintains and checks offsets in checkpoint location or it checks offset in kafka consumer_offset topic ? – Abhinav Kumar Mar 24 '22 at 13:07
  • I've already answered that for Structured Streaming. By default, offsets get committed back to that topic, yes. Checkpoints can be enabled as well for more fault tolerance than what Kafka natively provides – OneCricketeer Mar 24 '22 at 13:26

0 Answers0