How to process incremental S3 files in Spark

Question

I made the following pipeline: Task manager -> SQS -> scraper worker (my app) -> AWS Firehose -> S3 files -> Spark ->(?) Redshift.

Some things I am trying to solve/improve and I would be happy for guidance:

The scraper could potentially get duplicated data, and flush them again to firehose, which will result in dups in spark. Shall I solve this in the spark using Distinct function BEFORE starting my calculations?
I am not deleting the S3 processed files, so the data keeps getting big and big. Is this a good practice? (Having s3 as the input database) Or shall I process each file and delete it after spark has finished? Currently I am doing sc.textFile("s3n://...../*/*/*") - which will collect ALL my bucket files and run calculations on.
To place the results in Redshift (or s3) -> how can I do this incrementally? that is, if the s3 just getting bigger and bigger, the redshift will have duplicated data... Shall I always flush it before? how?

you can have your bucket for the elements to be processed, and once they have been pushed, you move them to another bucket so you keep a copy if needed but you will not reprocess them a second time — Frederic Henri, Aug 01 '16 at 13:00

shuaiyuancn · Accepted Answer · 2016-08-08T10:40:02.597

I have encountered these issues before although not in a single pipeline. Here are what I did.

Removing duplications

a. I used BloomFilter to remove local duplications. Note the doc is relatively incomplete, but you can save/load/union/intersect the bloom filter objects easily. You can even do reduce on filters.

b. If you save data directly from Spark to RedShift, chances are that you needs to spend some time and effort to update the BloomFilter for the current batch, broadcast it, then filter to ensure there is no duplications globally. Before I used a UNIQUE constraint in RDS and ignore the error, but unfortunately RedShift does not honour the constraint.
and 3. Data getting bigger

I used an EMR cluster to run s3-dist-cp command to move & merge data (because there are usually a lot of small log files, which impact Spark's performance). If you happen to use EMR to host your Spark cluster, just add a step before your analysis to move data from one bucket to another. The step takes the command-runner.jar as the Custom jar, and the command looks like

s3-dist-cp --src=s3://INPUT_BUCKET/ --dest=s3://OUTPUT_BUCKET_AND_PATH/ --groupBy=".*\.2016-08-(..)T.*" --srcPattern=".*\.2016-08.*" --appendToLastFile --deleteOnSuccess

Note that the original distcp doesn't support merging files.

Generally, you should try to avoid having processed and unprocessed data together in the same bucket (or at least, path).

How to process incremental S3 files in Spark

1 Answers1