Spark structured streaming maxOffsetsPerTrigger does not seem to work

Question

I had an issue with a Spark structured streaming (SSS) application, that had crashed due to a program bug and did not process over the weekend. When I restarted it, there were many messages on the topics to reprocess (about 250'000 messages each on 3 topics which need to be joined).

On restart, the application crashed again with an OutOfMemory exception. I learned from the docs that maxOffsetsPerTrigger configuration on the read stream is supposed to help exactly in those cases. I changed the PySpark code (running on SSS 2.4.3 btw) to have something like the following for all 3 topics

 rawstream = (spark.readStream
    .format("kafka")
    .option("kafka.bootstrap.servers", bootstrapServers)
    .option("subscribe", topicName)
    .option("maxOffsetsPerTrigger", 10000L)
    .option("startingOffsets", "earliest")
    .load()

My expectation would be that now the SSS query would load ~33'000 offsets from each of the topics and join them in the first batch. Then in the second batch it would clean the state records from the first batch with are subject to expiration due to watermark (which would clean up most of the records from the first batch) and then read another ~33k from each topic. So after ~8 batches it should have processed the lag, with a "reasonable" amount of memory.

But the application still kept crashing with OOM, and when I checked the DAG in the application master UI, it reported that it again tried to read all 250'000 messages.

Is there something more that I need to configure? How can I check that this option is really used? (when I check the plan, unfortunately it is truncated and just shows (Options: [includeTimestamp=true,subscribe=IN2,inferSchema=true,failOnDataLoss=false,kafka.b...), I couldn't find out how to show the part after the dots)

I am facing some issues now and it's always OOM, it tries to load all the data at the same time, @jammann did you get some solution? — krishna Prasad, Jun 22 '22 at 07:25
Unfortunately not, no. I got around it in my case by increasing memory. — jammann, Jun 23 '22 at 11:25

score 0 · Answer 1 · answered Apr 10 '23 at 15:29

0

You have to delete the checkpoint directory in order to get the new configuration refreshed.

answered Apr 10 '23 at 15:29

Ángel Álvarez Pascua

13
4

Spark structured streaming maxOffsetsPerTrigger does not seem to work

1 Answers1