Spark Streaming: Checkpoint corrupted

Asked Jun 02 '23 at 16:42

Active Jun 03 '23 at 14:23

Viewed 38 times

Today 4 streaming jobs started to fail due to: StreamingQueryException: [STREAM_FAILED] Query [id = ####, runId = ####] terminated with exception: dbfs:/mnt/path/my_table/sources/0/0 doesn't exist (latestId: 8, compactInterval: 10).

These streamings have been running for about +1 year.
The only change we did, was in March, and we added one more column to the schema.
These streamings read from S3, write parquet data, and run daily.
To keep track of files processed, we have a checkpoint in S3.

Root cause: I understand that the issue is that for some reason spark streamming lost the last state of the checkpoint + stopped writing to the checkpoint.

Anyone has experienced something like this? How do you manage to recover without processing all the files again?

Thanks in advance!

What we found:

When I go to path sources/0 the file 0 does not exists.
We find the file 711 that was created the 23 of May.
For some reason the 24 of May the streaming failed to get the latest batchId state and restarted, the batchId, to 0.
Also it stopped to write files in the sources, offset, and commits folder of the checkpoint location.

edited Jun 03 '23 at 14:23

asked Jun 02 '23 at 16:42

Martín Riccardi

Spark Streaming: Checkpoint corrupted

0 Answers0