2

We have a service where a DynamoDB table ~50GB is our feature repository, which we use for real-time, online applications.

We want to create a data lake from this table for historical data, model training and analytics insights. We want to guarantee a 30-minutes "freshness" of data lake data w.r.t. the original table.

However, I'm confused on what could be a good architecture for this: my understanding of data lakes is that you should use a storage service (i.e., S3) to store the raw data with no processing. Then, you perform ETL jobs, where you transform, process and filter the data (e.g., using Glue) before using for whatever app.

But here is my doubt: does this means that we have to dump the DynamoDB table into S3 every 30 minutes? This can be easily done, but it sounds weird (this would result in ~876TB/year).

Am I missing something in the data lake pipeline?

justHelloWorld
  • 6,478
  • 8
  • 58
  • 138

1 Answers1

2

You've hit a common problem, and its one AWS are actively working on.

If you want continous sync-ing from dynamodb to S3, its possible using existing technology including dynamodb streams. I suggest checking out this project in awslabs. Frankly its quite a bit of effort.

However, I believe AWS are about to release a product that will keep dynamodb tables and S3 buckets in sync, without code, in a few clicks. Its called AWS Glue Elastic Views. The product is in preview. They announced the product in December 2020 so I'm hoping it available soon. There is also a form you can fill in to join the trial but there is no guarantee AWS will give to access.

F_SO_K
  • 13,640
  • 5
  • 54
  • 83
  • From a quick look on the introduction video, it seems that Glue Elastic View is exactly what I'm looking for! About the Github project: I think that [this](https://aws.amazon.com/blogs/aws/new-export-amazon-dynamodb-table-data-to-data-lake-amazon-s3/) solution is easier (and maybe more reliable as offered by AWS) – justHelloWorld Aug 16 '21 at 18:13
  • After further thinking, I think that Glue Elastic View may not fit our use case: this service seems to keep one copy up-to-date, while what we need are snapshots of the table over time (so we can build historical data for ML). My understanding of GLV is that it would keep one version only (the most recent one). – justHelloWorld Aug 17 '21 at 07:59
  • Repeatedly saving the same data is not scalable. It sounds like what you should be doing is dating and versioning items in your source data. For example you have a created-date and archived-date for each item. If you want to see the data stores as they were on a particular date, find items created after X date and where archived date is before Y or does not exist. – F_SO_K Aug 17 '21 at 12:48
  • AWS have published some design patterns around versioning your data https://aws.amazon.com/blogs/database/implementing-version-control-using-amazon-dynamodb/ – F_SO_K Aug 17 '21 at 12:50