For ingesting data from an external storage location into Snowflake when de-duping is necessary, I came across two ways:
Option 1:
- Create a Snowpipe for the storage location (Azure container or S3 bucket) which is automatically triggered by event notifications (Azure event grid and queues or AWS SQS) and copy data into a staging table in Snowflake
- Create a Stream for this staging table to capture change data
- Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key
Option 2:
- Create an external table with automatic refresh through event notifications (Azure event grid and queues or AWS SQS)
- Create a Stream for this external table to capture change data
- Periodically run a task that consumes the Stream data and merges (upserts) data into the destination table based on the primary key
I believe if the merge statement wasn't necessary to enforce primary key and remove duplicates, Snowpipe was the clear winner because it copies changed data directly into a table in one step. However, since staging and merging the data is necessary, which option is better?
Thank you!