0

For ex:

Source : S3

Scheduled update: 2 times/Day

Target : Snowflake Table

Streaming : NO

On Day-0, I have a customer table created & loaded in Snowflake DB. S3 Files are updated 2 times a day and they need to reflected in Snowflake table as well.

I want to come up with a solution that so that following 3 cases are taken care 2 times/day without streaming or messaging technlogy

  1. Insert - new records
  2. Update - existing records
  3. Delete - existing records
osr
  • 21
  • 1
  • 3

1 Answers1

2

If you want them to stay in sync at all times, you could use EXTERNAL TABLES in Snowflake to make sure that tables are always kept in sync with Snowflake. Lay a MATERIALIZED VIEW over your external table, and now you essentially have a Snowflake table that is in sync with your S3 files. In this case, every time an S3 file is modified or added, the file will be loaded to the MV.

https://docs.snowflake.com/en/user-guide/tables-external.html

If the files that are coming have updated records and deletes marked in the data as CDC data, then you'd need to leverage streams and tasks, instead. First, load the file into a staging table using either a COPY INTO or Snowpipe (which requires SQS be setup), then put a stream over the staging table, and then finally create a task that checks the stream for new records and executes a MERGE against your final table.

https://docs.snowflake.com/en/user-guide/data-pipelines.html

Mike Walton
  • 6,595
  • 2
  • 11
  • 22
  • If I have a snowflake DB with 200 tables, can we still apply this technique for the entire DB ? I am looking for an ETL solution for change data capture (CDC) . Do we need to rely on ETL solution provided by snowflake or use an ETL tool like Informatica or Talend – osr May 25 '20 at 15:21
  • 1
    You can do either. Snowflake has the functionality to handle this using tasks and streams, or you can leverage one of the partner ETL tools like Informatica, Talend, Matillion, etc. – Mike Walton May 25 '20 at 15:38