8

I am using the file source in Spark Structures Streaming and want to delete the files after I process them.

I am reading in a directory filled with JSON files (1.json, 2.json, etc) and then writing them as Parquet files. I want to remove each file after it successfully processes it.

zero323
  • 322,348
  • 103
  • 959
  • 935
saul.shanabrook
  • 3,068
  • 3
  • 31
  • 49

3 Answers3

3

EDIT 2: Changed my go script to read sources instead. new script

EDIT: Trying this out currently, and it might be deleting files before they are processed. Currently looking for a better solution and investigating this method.

I solved this temporarily by creating a Go script. It will scan the checkpoints folder that I set in Spark and process the files in that to figure out which files have been written out of Spark already. It will then delete them if they exist. It does this every 10 seconds.

However, relies on Spark's checkpoint file structure and representation (JSON), which is not documented and could change at any point. I also have not looked through the Spark source code to see if the files I am reading (checkpoint/sources/0/...), are the real source of truth for processed files. Seems to be working ATM though! Better than doing it manually at this point.

saul.shanabrook
  • 3,068
  • 3
  • 31
  • 49
2

It is now possible in Spark 3. You can use "cleanSource" option for readStream.

Thanks to documentation https://spark.apache.org/docs/latest/structuread-streaming-programming-guide.html and this video https://www.youtube.com/watch?v=EM7T34Uu2Gg.

After searching for many hours, finally got the solution

Mr AK
  • 43
  • 7
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-ask). – Community Sep 20 '21 at 17:42
2

The documentation points to usage of cleanSource.

cleanSource: option to clean up completed files after processing.
Available options are "archive", "delete", "off". If the option is not provided, the default value is "off".

Refer: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#input-sources