Today 4 streaming jobs started to fail due to: StreamingQueryException: [STREAM_FAILED] Query [id = ####, runId = ####] terminated with exception: dbfs:/mnt/path/my_table/sources/0/0 doesn't exist (latestId: 8, compactInterval: 10).
- These streamings have been running for about +1 year.
- The only change we did, was in March, and we added one more column to the schema.
- These streamings read from S3, write parquet data, and run daily.
- To keep track of files processed, we have a checkpoint in S3.
Root cause: I understand that the issue is that for some reason spark streamming lost the last state of the checkpoint + stopped writing to the checkpoint.
Anyone has experienced something like this? How do you manage to recover without processing all the files again?
Thanks in advance!
What we found:
- When I go to path sources/0 the file 0 does not exists.
- We find the file 711 that was created the 23 of May.
- For some reason the 24 of May the streaming failed to get the latest batchId state and restarted, the batchId, to 0.
- Also it stopped to write files in the sources, offset, and commits folder of the checkpoint location.