2

Today 4 streaming jobs started to fail due to: StreamingQueryException: [STREAM_FAILED] Query [id = ####, runId = ####] terminated with exception: dbfs:/mnt/path/my_table/sources/0/0 doesn't exist (latestId: 8, compactInterval: 10).

  • These streamings have been running for about +1 year.
  • The only change we did, was in March, and we added one more column to the schema.
  • These streamings read from S3, write parquet data, and run daily.
  • To keep track of files processed, we have a checkpoint in S3.

Root cause: I understand that the issue is that for some reason spark streamming lost the last state of the checkpoint + stopped writing to the checkpoint.

Anyone has experienced something like this? How do you manage to recover without processing all the files again?

Thanks in advance!

What we found:

  • When I go to path sources/0 the file 0 does not exists.
  • We find the file 711 that was created the 23 of May.
  • For some reason the 24 of May the streaming failed to get the latest batchId state and restarted, the batchId, to 0.
  • Also it stopped to write files in the sources, offset, and commits folder of the checkpoint location.

0 Answers0