what is the best way to re-create relational database from change log(data lake) in AWS S3?

Question

I have stored changelogs(data with information about data) from non-relational schemaless data tables to S3. now I want some structured relational database to query on all the data. So I need to create a database from S3. Now I am confused about what should I do, whether using another S3 or using some traditional database!!!

Shubham Jain · Answer 1 · 2020-05-04T16:43:32.367

1

You can create glue catalog over the data and query it using serverless Athena. This way you are not bound to use any rdbms and can query your data at any required time keeping the files in s3.

This will also be cost effective. Or you can anytime spin up a RDS in AWS if requires. So keeping files in s3 is good option.

edited May 04 '20 at 16:43

answered May 04 '20 at 16:38

Shubham Jain

5,327
2
15
38

The thing is I need to convert the data from schema-less to relational. If I am using glue with athena then I need some cron with lambda who will do that and need to create another S3. But will it be a good option to store in another S3 as All I need is a platform where I can run sql queries and get results really fast. – isambitd May 04 '20 at 17:21
You can just add a s3 event and process the file as soon as it arrives in s3 using lambda. And it will be really worth it as you are paying only for what you query and you can perform any analytics on your data at later stage. S3 storage cost is very cheap and after processing through lambda just move your raw files to glacier. – Shubham Jain May 04 '20 at 17:49
Thanks for your reply. I am using this setup currently. I am using Athena to query in S3. The thing is when I am quering for larger data, Athena is working fine or it is also working nice with when we are doing analytics with some third party lib. But it is not a good option when we have to run samller query multiple times. Then the time taken by athena is really imapcting. – isambitd May 05 '20 at 14:49
1

In that scenario you can always leverage s3-select. You can write a custom script and that will run your query using s3-select which is very fast believe me. – Shubham Jain May 05 '20 at 15:16
Thanks Shubham a lot. I never used s3-select for production. But I will definitely try. I was also considering AWS RDS postgre/sql. What will be the pros and cons of using those compared to Athena(bulk query) + s3-select(individual queries)? – isambitd May 07 '20 at 08:52
So if your data is transactional and need frequent access and in future you want to capture changed data and you have no problem in managing a database their maintainence, downtime issues then go for RDS as it will fulfill your querying needs. Athena is more preferred when you want to create your data lake cost efficient and you want to keep all your data on s3 itself. – Shubham Jain May 07 '20 at 11:34

what is the best way to re-create relational database from change log(data lake) in AWS S3?

1 Answers1