I have an Azure Data Lake Storage container which acts as a landing area for JSON files to process by Apache Spark.
There are tens of thousands of small (up to a few MB) files there. The Spark code reads these files on a regular basis and performs some transfomations.
I want the files to be read exactly once and the Spark script to be idempotent. How do I ensure that the files are not read again and again? How do I do it in an efficient manner?
I read the data this way:
spark.read.json("/mnt/input_location/*.json")
I thought about the following approaches:
- Create a Delta table with the file names that have already been processed and run the EXCEPT transformation on the input DataFrame
- Move the processed files to a different location (or rename them). I would rather not do that. In case I need to reprocess the data, I need to run the rename once again this operation takes a long time.
I hope there is a better way. Please suggest something.