I have to ingest 15 TB of data from S3 to DynamoDB. There isn't any transformation required except that for adding a new column (insert date).
The data in S3 is in parquet format with snappy compression. The data in S3 has a different partition key but the DynamoDB table has different partition and sort key.
How can we process everything in 101-5 days? We can use Amazon EMR with Hive or Spark. Please recommend the configuration of the cluster (instance, executor memory, core etc) as well. Also DynamoDB is on-demand mode so no limitation on provisioning. DynamoDB is a limiting factor as it can't handle so many writes so need to have a good retry logic (with exponential backoff).