I had an issue with a Spark structured streaming (SSS) application, that had crashed due to a program bug and did not process over the weekend. When I restarted it, there were many messages on the topics to reprocess (about 250'000 messages each on 3 topics which need to be joined).
On restart, the application crashed again with an OutOfMemory exception. I learned from the docs that maxOffsetsPerTrigger
configuration on the read stream is supposed to help exactly in those cases. I changed the PySpark code (running on SSS 2.4.3 btw) to have something like the following for all 3 topics
rawstream = (spark.readStream
.format("kafka")
.option("kafka.bootstrap.servers", bootstrapServers)
.option("subscribe", topicName)
.option("maxOffsetsPerTrigger", 10000L)
.option("startingOffsets", "earliest")
.load()
My expectation would be that now the SSS query would load ~33'000 offsets from each of the topics and join them in the first batch. Then in the second batch it would clean the state records from the first batch with are subject to expiration due to watermark (which would clean up most of the records from the first batch) and then read another ~33k from each topic. So after ~8 batches it should have processed the lag, with a "reasonable" amount of memory.
But the application still kept crashing with OOM, and when I checked the DAG in the application master UI, it reported that it again tried to read all 250'000 messages.
Is there something more that I need to configure? How can I check that this option is really used? (when I check the plan, unfortunately it is truncated and just shows (Options: [includeTimestamp=true,subscribe=IN2,inferSchema=true,failOnDataLoss=false,kafka.b...)
, I couldn't find out how to show the part after the dots)