What is the best way to backup data from DynamoDB to S3 buckets?

Question

I am working on a solution to send DynamoDB data to an S3 bucket for multiple teams to consume. I need 2 separate solutions:

Initial data dumb of the table (~ 3 TB)
Continuous updates of the add/modify/delete operations (but not the schema) also sent to the S3 bucket.

Based on the latest functionality available from AWS, what is the most affordable solution for this, in terms of cost and time? Is there a no-code solution that can easily be created for this and get it up and running fast?

I have considered the following:

PITR backup for initial load to S3
DynamoDB Streams and/or Kinesis streams/firehose for updates/adds/deletes

Would EMR or Glue be helpful here? What else should I be considering?

score 1 · Answer 1 · answered Jul 29 '21 at 15:22

I'd only not use the newish Export to S3 functionality if you have a good reason not to. For example, E->S3 puts the data into a format you cannot easily consume or you need an ongoing updated copy as it is a point in time export. If that format is not what you want, then I'd look at the CDC options or Glue/EMR for ETL. Just know that Glue/EMR usually pull from the DynamoDB table itself, whereas Export to S3 uses the PITR backup data and therefore puts zero load on the table itself. You could also use Export to S3 and then use Glue to transform the data into the end format you need. I do not know the ongoing cost info off the top of my head, but that should be a consideration as well.

score 0 · Answer 2 · answered Jun 23 '22 at 05:17

I am working on this problem as well. Some notes of what I have been thinking:

For initial load from DynamoDB to S3 I decided on Export to S3 in parquet format. These large datasets benefit from columnar storage, compression, and partitioning for subsequent ETL jobs. Parquet supports nested objects through to the warehouse where most SQL databases can query them to update relational model tables. I then ingest the initial load into SQL.

For continuous updating of data, I decided on enabling Dynamo Streams sending continuous CDCs to Firehose with batching enabled, landing in S3 in parquet format as well, for consistency.

From there a Glue job reads the parquet file into a pandas or spark dataframe, joins data from the SQL server, makes minor transformss but, and writes new dataframes in separate buckets that will be ingested into SQL for CRUD operations, updating the following tables:

SCD Type 2 (valid_from, valid_to, is_current)
Change History (old value, new value)
Transaction Fact Table (Changes from History we care about)
Aggregate Snapshot Tables

These would be orchestrated by Airflow-triggered stored procedure.

And I hope my tables for bi analysts and ml teams will be good to go. If kpis change I can recompute kpi snapshots for historical data all in Sql.

What is the best way to backup data from DynamoDB to S3 buckets?

2 Answers2