What is the best approach for this batch spark use case

Question

I am trying to build a solution over S3. I have a lot of files that are dumped every hour to s3. Now in spark I need to process these file and write again back to s3. What is the best approach? One of the approach I have thought of is

1) Whenever a file is written to s3 generate an event in SQS 2) Spark running in batch mode will read sqs events, process all events in s3 at that time and write them back to s3.

The issues I see here are 1) What happens if after processing the message in spark and writing to s3 before deleting messages from sqs my park goes down? Is it possible to make sqs deletion and writing to s3 as atomic operation in spark?

score 0 · Answer 1 · answered Feb 03 '18 at 19:50

Try using AWS Data Pipeline to automate this task.

You can configure to trigger the EMR Spark Cluster to start every hour provided that the files appear on s3 every hour, process the data and store results back to s3.

Your cluster can be terminated when the job is finished.

https://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-console-templates.html

What is the best approach for this batch spark use case

1 Answers1