0

My objective is to get the data from S3 files, transform and save it to a datasource(could be dynamoDB or RDS). And the filesize would be <20MB and there could be multiple(~10) such files uploaded periodically (once a day). I'm considering using below approaches.

  1. AWS lambda
  2. AWS batch.

Ideally, file processing should take less than 15 mins, but there is no guarantee on the file size. So in theory file processing could be beyond lambda's processing capabilities. So the approach I thought of is to check beforehand if the file processing can be done via lambda. If yes, invoke the lambda. Else Trigger Batch job. As of now, using dynamoDB is what I'm considering, but there is no guarantee that the item size < 400KB, but in practice the item size would be <400KB. Would my proposed design be any different if I switch the db to RDS?

Another question I have is when to consider traditional ETL approaches like using a AWS data pipeline or EMR or Glue.

  • 1
    How long would it take to process one file? Specifically, would it be under 15 minutes? Can the files processed individually, or does it need to be done in batches? What is the format of the files and what type of processing needs to be done? Which target do you actually want to use (DynamoDB or Amazon RDS) — they are quite different types of storage and the choice should be based on how you intend to _use_ the data once it is in the database. Feel free to edit your question to include these details instead of answering in a comment. – John Rotenstein Oct 30 '19 at 02:34
  • Added more details to the question. – user3035692 Oct 30 '19 at 05:32
  • You can increase the memory of a Lambda function, which also increases the CPU. This might speed-up the operation. Another option is having the S3 Event trigger an AWS Lambda function, which then launches an Amazon EC2 instance to process the file. That way, the processing can go longer than 15 minutes. EC2 is charged per-second, so it is quite cost-effective. – John Rotenstein Oct 30 '19 at 06:11
  • The lambda launching EC2 instance and the aws batch option are more or less the same. Do you have any idea like when should we consider the ETL options like data pipeline/ Glue? – user3035692 Oct 30 '19 at 06:51
  • We have our ETL jobs configured using AWS Lambda Event Trigger on a S3 location, and if a file is uploaded, the lambda triggers to start a corresponding AWS Glue job. The glue job reads the file from the S3 and load into RDS directly, using RDS features. For RDS operations, we have used pymysql library attached to the AWS Glue for performing UPSERT operations to RDS. – Yuva Oct 30 '19 at 07:02
  • Can you tell me how large the files are? – user3035692 Oct 30 '19 at 07:08
  • The max file that we haved used is abt 15GB in one load, and this happens every quarter in our ETL, please check here for code snippets if needed, https://stackoverflow.com/questions/55621916/automate-bulk-loading-of-data-from-s3-to-aurora-mysql-rds-instance/55682730#55682730 – Yuva Oct 30 '19 at 08:21

0 Answers0