I am currently using AWS S3 as a data lake to store raw data, which adds about 100 items every minute to the designated bucket. I know the very basics of the data pipeline and data ETL concept, but I am still unfamiliar with the fundamentals, such as what Apache Spark is, or how AWS Glue exactly works.
I am willing to try out all the tutorials and learn for myself, but I am lost on where to begin. If you could please guide me on where to begin for the following tasks.
- Whenever new objects are added to the S3 bucket, transform them and store them in another data store.
- Where to store the resulting transformed item, if it is to be managed in a large CSV format (my guess is DynamoDB as it is table data?).
- How would the low-level solution and the high-level solution for these tasks? (for example, using Spark vs Glue)
Thank you!