In my organization we have application that gets events and stores them on s3 partitioned by day. Some of the events are offline which means that while writing we append the files to the proper folder (according to the date of the offline event).
We get the events by reading folders path from a queue (SQS) and then reading the data from the folders we got. each folder will contain data from several different event dates
The problem is that if the application failed for some reason after one of the stages was completed, I have no idea what was already written to the output folder and I can't delete it all because there is already other data there.
Our solution currently is writing to HDFS and after application finishes we have a script that copies files to s3 (using s3-Dist-cp). But that doesn't seem very elegant.
My current approach is to write my own FileOutputCommmitter that will add an applicationId prefix to all written files and so in case of error I know what to delete.
So what I'm asking is actually is there an already existing solution to this within Spark and if not what do you think about my approach
--edit--
After chatting with @Yuval Itzchakov I decided to have the application write to and add this path to an AWS SQS queue. An independent process will be triggered every x minutes, read folders from SQS and copy them with s3-dist-cp from to . in the application I wrapped the main method with try-catch, if I catch exception I delete the temp folder.