0

I have a data lake in AWS S3. The format of data is Parquet. Daily workload is ~70G. I want to build some ad-hoc analytics on top of that data. To do that I see 2 options:

  1. Use AWS Athena to request data via HiveQL to get data via AWS Glue (Data Catalog).
  2. Move data from S3 into Redshift as a data warehouse and query Redshift to perform ad-hoc analysis.

What is the best way to do ah-hoc analysis in my case? Is there more efficient way? And what are pros and cons of mentioned options?

PS

After 6 months I'm going to move data from S3 to Amazon Glacier, so that max data volume to query in S3/Redshift can be ~13T

Finkelson
  • 2,921
  • 4
  • 31
  • 49
  • You say "Move data from S3 into Redshift". Have you thought about Redshift Spectrum and query the data in place (S3)? You will need to have appropriate data partitioning in place but Spectrum can be a fantastic solution for lots of query patterns (but not all). – Bill Weiner Feb 24 '21 at 17:04
  • It's interesting option. I didn't consider it yet. Thank you. – Finkelson Feb 24 '21 at 20:01
  • Since you are interested in Spectrum I'll give a quick summary of pros/cons. Spectrum passes some work off to S3 where it can be done locally on the data. WHERE clauses, grouping and simple aggregation can all be done in S3 which can create really fast solutions. However joins, windows, and everything post any of actions needed to be done in Redshfit will be done in Redshift. So if you need to join tables before applying a WHERE clause and doing grouping will require that all the data be moved to Redshift. Not so performant. With the right data model and query patterns Spectrum is great. – Bill Weiner Feb 24 '21 at 20:57

0 Answers0