15 TB data ingestion from S3 to DynamoDB

Question

I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date).

The data in S3 is in parquet format with snappy compression. The data in S3 has a different partition key but the DynamoDB table has different partition and sort key.

How can we process everything in 101-5 days? We can use Amazon EMR with Hive or Spark. Please recommend the configuration of the cluster (instance, executor memory, core etc) as well. Also DynamoDB is on-demand mode so no limitation on provisioning. DynamoDB is a limiting factor as it can't handle so many writes so need to have a good retry logic (with exponential backoff).

If you are looking for 'opinions', you might get a better response at: https://www.reddit.com/r/aws — John Rotenstein, Feb 26 '23 at 06:30

score 1 · Answer 1 · answered Feb 26 '23 at 08:52

Use AWS Glue.

DynamoDB is most definitely not a limiting factor, there is no scale that it cannot reach. I suggest that you use provisioned mode, for cost reasons. Manually provision your table with as much capacity as you can consume from Glue and switch back to On-demand as soon as the job is complete.

15 TB data ingestion from S3 to DynamoDB

1 Answers1