0

I have dynamo DB table on which I need to perform these actions on a weekly/monthly basis.

  1. export data into s3
  2. delete from Dynamo DB, the data exported into S3

Use case: We have only 10% traffic open and have 3k items and growing. Also we need to give access to this data for another account and prefer not to give access to table directly. To save the retrieve time and allow data access to different account, and data may not be used again in near future, we are planning to import data to S3.

Options:

  1. Data pipeline is too complex and we don't wish to use EMR cluster.
  2. Not going with glue since there is no analysis to be performed.
  3. AWS in-build DynamoDB to S3 import

Planning for s3 import(3)+ lambda to schedule the import and delete the dynamo DB records based on EventBridge rule.

Will this suffice or is there any better approach? Please advice.

Reshma
  • 69
  • 2
  • 11

2 Answers2

1

A few options to consider:

Evergreen tables pattern

  1. Create a new table each month, have your application write to the new table based on current time
  2. When new month comes, old month's table can be exported to S3.
  3. Delete old month's table after export is done and you don't need it anymore

This one is probably the most cost effective because you can control the duration the items sit around better. The biggest hassle is needing to provision new tables, update permissions, and have application logic to switch at the right now. Once it's up and running, it should be smooth though. This is a pattern that's really common for folks using DDB for things like ML models where they rotate them regularly and don't wanna pay for deleting all the old data. If you have strict SLAs on how long old data can be around, this might be the best option.

TTL pattern

  1. Set all your data to TTL at the end of the month
  2. Export your data before TTL window
  3. Let TTL expire items

This has the issue that TTL can take a fairly long time (days) to clean up a lot of items, since it's using background WCUs, which means you pay for the storage for a bit longer. Plus side is that it is cost effective on WCUs. If you don't have some compliance need to get the data off DDB at a specific time, this works fine.

Glue scan and delete pattern

I say use Glue, but really it's just that Spark-like things are pretty effective at doing stuff like this, even if it isn't analytics. You can also make it work with something like Step Functions, if you'd rather do that.

  1. Kick off export
  2. Use the export data in Glue to then have Glue kick off deletes of DDB

This has the downside of being fairly expensive (gotta have extra WCUs to handle the deletes). It's fairly simple from your application's perspective, though. If you can't change application logic (to set TTL or which table is being written to), I'd go with this option.

Chris Anderson
  • 8,305
  • 2
  • 29
  • 37
0

You can use https://www.npmjs.com/package/dynoport to export data from dynamodb in a high performant way and export it to s3 using a ecs cron