Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

Question

*Disclaimer: *This is my first time ever posting on stackoverflow, so excuse me if this is not the place for such a high-level question.

I just started working as a data scientist and I've been asked to set up an AWS environment for 'external' data. This data comes from different sources, in different formats (although its mostly csv/xlsx). They want to store it on AWS and be able to query/visualize it with Tableau.

Despite my lack of AWS experience I managed to come up with a solution that's more or less working. This is my approach:

Raw csv/xlsx are grabbed using a Lambda
Data is cleaned and transformed using pandas/numpy in the same Lambda as 1.
The processed data is written to S3 folders as CSV (still within the same lambda)
Athena is used to index the data
Extra tables are created using Athena (some of which are views, others aren't)
Athena connector is setup for Tableau

It works but it feels like a messy solution: the queries are slow and lambdas are huge. Data is often not as normalized as it could be, since it increases query time even more. Storing as CSV also seems silly

I've tried to read up on best practices, but it's a bit overwhelming. I've got plenty questions, but it boils down to: What services should I be using in a situation like this? What does the high-level architecture look like?

score 1 · Answer 1 · answered Apr 16 '20 at 15:24

I have a fairly similar use-case; however, it all comes down to the size of the project and how for you want to take robustness / future planning of the solution.

As a first iteration, what you have described above seems like it works and is a reasonable approach but as you pointed out is quite basic and clunky. If the external data is something you will be consistently ingestion and can foresee growing i would strongly suggest you design a datalake system first, my recommendation would be either use AWS lake formation service or if you want more control, and build ground up, use something like the 3x3x3 approach.

By designing your datalake correctly managing the data in the future becomes much simpler and nicely partitions your files for future use / data diving.

As a high level architecture would be something like:

Lambda get request from source and paste to s3
Datalake system handles file and auto partitions + tags

then,

Depending on how quickly you need to visualise your data and if it large amounts of data potentially use AWS glue pyshell or pyspark instead of lambda. Which will handle your pandas/numpy a lot cleaner.

I would also recommend converting your files into parquet if your using Athena or equivalent for improved query speed. Remember file partitioning is important to performance!

Note, the above is for quite a robust ingestion system and may be overkill if you have a basic use case with infrequent data ingestion.

If your data is in small packets but is very frequent you could even use a kinesis layer in-front of the lambda to s3 step to pipe your data in a more organised manner. You could also use redshift to host your files instead of S3 if you wanted a more contemporary warehouse solution. However, if you have x sources i would suggest stick with s3 for simplicity.

Best practices for setting up a data pipeline on AWS? (Lambda/EMR/Redshift/Athena)

1 Answers1