0

I am very new to Data Architecture and here I want to build an end-to-end architecture:

  • Source: Snowflake Tables
  • Target: Snowflake Tables

In between we have to do some processing, here is the flow:

  1. We are exporting data from snowflake tables (specific columns using joins) to AWS S3.
  2. Now these files are being used by AWS SageMaker (Code is already written to process these files in python) but need to build the pipelines for the same.
  3. Once the processing is done again processed data is sent in AWS S3 (another bucket).
  4. Now again we need to load these files into snowflake tables.

Requirement:

I need to link all these tools and create a workflow.

Firstly, how to create a pipeline in Sagemaker for a defined code?

My approach:

  1. We can create a snowflake to load data from snowflake tables to AWS S3.
  2. Assuming we have created a pipeline, then create an AWS lambda function to trigger AWS Sagemaker pipeline once the file is available in AWS S3.
  3. Now again processed data is available in AWS S3 trigger an Airflow DAG to load data into Snowflake tables.

Here I am not able to figure out the linkage between tools in a one workflow.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
Logan27
  • 25
  • 4
  • 1
    Have you considered running your python directly in Snowflake, rather than extracting data out and back in? You could still use Airflow to coordinate, or use Snowflake's Tasks to do the same. – Mike Walton Jul 25 '23 at 13:21
  • Or may be create bunch of SQL files, run them through DBT and viola you will get the data. DBT works very well with the Snowflake and it provide lot of functionality to do ETL/ELT type thing. But this will mean redesign but if you are looking to better your system, reduce number of tools and stop moving data around. The more tools you use the more cost and dependency increases. – Koushik Roy Jul 25 '23 at 14:06

0 Answers0