1

I would like to request you for your help.

I have been working with DataBricks. We developed some scrips and they are working in streaming. Let's suppose that we have two jobs running and writing data to one general local dataset. This means notebook1 and notebook2 writing data at the same LDS.

Each notebook read data from different origins and write the data to the same LDS in an standard format. To avoid problems we made use of partitions at the LDS.

This means that in this case the LDS have one partition for notebook1 and other partition for notebook2.

This implementation has been working well for almost 5 months.

However, today we just faced the following error:

com.databricks.sql.transaction.tahoe.DeltaFileNotFoundException: No file found in the directory: dbfs:/mnt/streaming/streaming1/_delta_log.

I have been looking for information for some way to solve it and the solutions that I found have been:

  1. Solution 1 Which explain some reasons why this situations could happen and they say we should use a new checkpoint directory, or set the Spark property spark.sql.files.ignoreMissingFiles to true in the cluster’s Spark Config. The first solution of using a new checkpoints directory is not possible for us to use due the requeriments that we need to satisfy because using a new checkpoints would mean for us to process the whole data again that has been processed. You may ask why? In a summary we get updates from a database that is saved in a delta table that contais the raw data and is where we consume the data, so using a new checkpoint or deleting it would mean for us consume the whole data. This only allow us to use the solution of applying the property of spark.sql.files.ignoreMissingFiles. However, my question here is: If we set this property, Would we be processing the data from the beginning? Or it would resume to process where the last checkpoints was?

  2. Solution 2 I found a similar case here, however I didn't understand it at all, what they suggest is to change the parent directory, however we do have something similar to that which could not satisfy our problem and also add the directory in the start() option?

We have our mains streaming like this:

spark.readStream.format("delta") \
  .option("readChangeFeed", "true") \
  .option("maxFilesPerTrigger", 250) \
  .option("maxBytesPerTrigger", 536870912)\
  .option("failOnDataLoss", "true")\
  .load(DATA_PATH)\
  .filter(expr("_change_type not in ('delete', 'update_preimage')"))\
  .writeStream\
  .queryName(streamQueryName)\
  .foreachBatch(MainFunctionstoprocess)\
  .option("checkpointLocation", checkpointLocation)\
  .option("mergeSchema", "true")\
  .trigger(processingTime='1 seconds')\
  .start()

Does anyone have some idea how we could solve this problem without deleting the checkpoints so we can resume the data in the last checkpoint it failed, or some way to get back to one checkpoint so we can only reprocess some part of the data?

Alex Ott
  • 80,552
  • 8
  • 87
  • 132
Alex
  • 11
  • 2
  • `streaming1` is place to read data or to output the data? – Alex Ott Feb 18 '23 at 10:02
  • Both of the streamings are the place to read data. For example streaming1 reads from the path `/mnt/streaming/streaming1/` and streaming2 reads from `/mnt/streaming/streaming2/`. Finally both of them write at `/mnt2/silverdata/LDS` (Local Data Set). – Alex Feb 18 '23 at 13:49
  • We have the same behaviour for some of our jobs, does an optimize during a streaming job would "corrupt" the table/streaming ? – Arthur Clerc-Gherardi Jul 07 '23 at 13:25

0 Answers0