0

I'm new to building data pipelines where dumping files in the cloud is one or more steps in the data flow. Our goal is to store large, raw sets of data from various APIs in the cloud then only pull what we need (summaries of this raw data) and store that in our on premises SQL Server for reporting and analytics. We want to do this in the most easy, logical and robust way. We have chosen AWS as our cloud provider but since we're at the beginning phases are not attached to any particular architecture/services. Because I'm no expert with the cloud nor AWS, I thought I'd post my thought for how we can accomplish our goal and see if anyone has any advice for us. Does this architecture for our data pipeline make sense? Are there any alternative services or data flows we should look into? Thanks in advance.

1) Gather data from multiple sources (using APIs)

2) Dump responses from APIs into S3 buckets

3) Use Glue Crawlers to create a Data Catalog of data in S3 buckets

4) Use Athena to query summaries of the data in S3

5) Store data summaries obtained from Athena queries in on-premises SQL Server

Note: We will program the entire data pipeline using Python (which seems like a good call and easy no matter what AWS services we utilize as boto3 is pretty awesome from what I've seen thus far).

eTothEipiPlus1
  • 577
  • 2
  • 9
  • 28

1 Answers1

1

You may use glue jobs (pyspark) for #4 and #5. You may automate flow using Glue triggers

Sandeep Fatangare
  • 2,054
  • 9
  • 14