I am currently setting up a data pipeline in databricks. The situation is as follow:
Incoming data comes as json-files. Data is being fetched asynchronously to the filestore. In case data is received multiple times a day, this is put into the same json-files.
The pipeline is triggered once a day. As far as I understood it, if the pipeline is executed before all data of one day is collected, then the file is already marked as processed and will not be re-evluated even though new data was incoming after the execution of the pipeline. Resulting in the situation, that the delta tables are missing this data.
Is there any way of fixing this behavior?