Spark - how to keep data integrity when writing files to appended folder

Question

In my organization we have application that gets events and stores them on s3 partitioned by day. Some of the events are offline which means that while writing we append the files to the proper folder (according to the date of the offline event).

We get the events by reading folders path from a queue (SQS) and then reading the data from the folders we got. each folder will contain data from several different event dates

The problem is that if the application failed for some reason after one of the stages was completed, I have no idea what was already written to the output folder and I can't delete it all because there is already other data there.

Our solution currently is writing to HDFS and after application finishes we have a script that copies files to s3 (using s3-Dist-cp). But that doesn't seem very elegant.

My current approach is to write my own FileOutputCommmitter that will add an applicationId prefix to all written files and so in case of error I know what to delete.

So what I'm asking is actually is there an already existing solution to this within Spark and if not what do you think about my approach

--edit--

After chatting with @Yuval Itzchakov I decided to have the application write to and add this path to an AWS SQS queue. An independent process will be triggered every x minutes, read folders from SQS and copy them with s3-dist-cp from to . in the application I wrapped the main method with try-catch, if I catch exception I delete the temp folder.

*if the application failed* By application failed, you mean the entire job terminated? Or that particular stage failed? — Yuval Itzchakov, Aug 24 '16 at 13:39
I mean the job terminated. a single task is not a problem because it is handled by commitTask — Tal Joffe, Aug 24 '16 at 13:50
Why not write everything to a temporary directory in S3, and at the end of each batch move it to the proper folder? That way, you can isolate the problem and delete the entire temp content if you want. — Yuval Itzchakov, Aug 24 '16 at 13:53
@YuvalItzchakov I'm currently actually trying to decide between what you said and the approach in the question. your suggestion definitely will solve the problem but it will force to store some of the data twice and have a separate process that will delete files. have you done something like this in the past? — Tal Joffe, Aug 24 '16 at 13:59
Not sure I understand why it forces you to duplicate data? You write to a temporary key, and copy the objects from the temporary location once the batch completes, and delete the temporary keys. In case the application crashed, you can either decide to delete the previous keys and re-write them, or continue from the last keys ID and forward. — Yuval Itzchakov, Aug 24 '16 at 14:01
o.k. right I was thinking about doing it in parallel and that's the reason for the duplication. If I'll copy from within the application I think it will be less efficient than using s3-dist-cp and also it will be blocking for next cycle — Tal Joffe, Aug 24 '16 at 14:06
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/121751/discussion-between-yuval-itzchakov-and-tal-joffe). — Yuval Itzchakov, Aug 24 '16 at 14:07

Spark - how to keep data integrity when writing files to appended folder

0 Answers0