How would a data pipeline using S3 as raw data work?

Question

I am currently using AWS S3 as a data lake to store raw data, which adds about 100 items every minute to the designated bucket. I know the very basics of the data pipeline and data ETL concept, but I am still unfamiliar with the fundamentals, such as what Apache Spark is, or how AWS Glue exactly works.

I am willing to try out all the tutorials and learn for myself, but I am lost on where to begin. If you could please guide me on where to begin for the following tasks.

Whenever new objects are added to the S3 bucket, transform them and store them in another data store.
Where to store the resulting transformed item, if it is to be managed in a large CSV format (my guess is DynamoDB as it is table data?).
How would the low-level solution and the high-level solution for these tasks? (for example, using Spark vs Glue)

Thank you!

score 2 · Accepted Answer · answered Feb 17 '21 at 14:28

This depends on the usecase.

For in-place transformation you can / should go with AWS Lambda. For batch-transformation you can use e.g. Glue or EMR, both can run Spark.

Where to store them / in what format depends on your access patterns, storing them in a dynamo for example without understanding the access patterns very well is a very bad idea. Keeping them in S3, properly partitioned, having a metastore in Glue and accessing them via Athena might work. But that is very slow and will not work well with 100 files / minute, you need less files, and bigger files, "micro batches". In all cases every client could create a specific read model based on the data and store it and index it however they like for the actual application access.

You have to ask yourself a couple of questions:

how well do you know the data?
do you know how much is coming in?
do you know how fast the data needs to be available?
do you know how the data will be accessed?
is it real-time data or batch data?
...

I suggest you simply need to start working / experimenting with it, creating a data lake and its architecture is a process that takes years.

How would a data pipeline using S3 as raw data work?

1 Answers1