I made the following pipeline: Task manager -> SQS -> scraper worker (my app) -> AWS Firehose -> S3 files -> Spark ->(?) Redshift.
Some things I am trying to solve/improve and I would be happy for guidance:
- The scraper could potentially get duplicated data, and flush them again to firehose, which will result in dups in spark. Shall I solve this in the spark using Distinct function BEFORE starting my calculations?
- I am not deleting the S3 processed files, so the data keeps getting big and big. Is this a good practice? (Having s3 as the input database) Or shall I process each file and delete it after spark has finished? Currently I am doing
sc.textFile("s3n://...../*/*/*")
- which will collect ALL my bucket files and run calculations on. - To place the results in Redshift (or s3) -> how can I do this incrementally? that is, if the s3 just getting bigger and bigger, the redshift will have duplicated data... Shall I always flush it before? how?