Databricks processed files

Question

I am currently setting up a data pipeline in databricks. The situation is as follow:

Incoming data comes as json-files. Data is being fetched asynchronously to the filestore. In case data is received multiple times a day, this is put into the same json-files.

The pipeline is triggered once a day. As far as I understood it, if the pipeline is executed before all data of one day is collected, then the file is already marked as processed and will not be re-evluated even though new data was incoming after the execution of the pipeline. Resulting in the situation, that the delta tables are missing this data.

Is there any way of fixing this behavior?

score 0 · Answer 1 · answered Apr 08 '23 at 17:02

The cloudFiles.allowOverwrites may help you. Per documentation:

Whether to allow input directory file changes to overwrite existing data. Available in Databricks Runtime 7.6 and above.

But then you will need to handle duplicates inside your data processing pipeline.

Databricks processed files

1 Answers1