-1

I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date).

The data in S3 is in parquet format with snappy compression. The data in S3 has a different partition key but the DynamoDB table has different partition and sort key.

How can we process everything in 101-5 days? We can use Amazon EMR with Hive or Spark. Please recommend the configuration of the cluster (instance, executor memory, core etc) as well. Also DynamoDB is on-demand mode so no limitation on provisioning. DynamoDB is a limiting factor as it can't handle so many writes so need to have a good retry logic (with exponential backoff).

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
dba
  • 11
  • 2

1 Answers1

1

Use AWS Glue.

DynamoDB is most definitely not a limiting factor, there is no scale that it cannot reach. I suggest that you use provisioned mode, for cost reasons. Manually provision your table with as much capacity as you can consume from Glue and switch back to On-demand as soon as the job is complete.

Leeroy Hannigan
  • 11,409
  • 3
  • 14
  • 31