1

We want to ingest our source tables from an s3 external stage into Snowflake. For this ingestion we have to consider, new files arriving in the s3 bucket, updates in existing files, and in some cases row deletions.

We are evaluating 3 approaches so far:

  • full drop & copy of tables daily (straightforward but less performant)
  • copy command which will capture new and updated files, and then execute a merge query to deduplicate and even delete rows based on specific use cases (could work out, but we need to maintain more complex merge & delete logic per case)
  • use external tables on top of the external s3 stage, and materialized views on top of the external tables, to boost the query performance.(not sure if this is a suggested ingestion mechanism)

From all the 3 approaches the external tables & materialized view seems to be the one that keeps all tables up to date with the source in a more care-free / less-maintenance way. But at the same time, we haven't seen this suggested as an ingestion mechanism for Snowflake. Also it is not very clear to us how snowflake handles the maintenance of materialized views, to understand cost implications (could it be compared cost-wise to the copy and merge queries of the 2nd approach for example?).

In general, are there any best practices regarding our ingestion use case? Also some feedback regarding the 3rd approach would be very useful.

  • This is a very involved topic and requires in depth discussions. You should contact your Snowflake Sales Engineer to get guidance here. If your Sales Engineer needs to call in an expert on performance and/or data pipelines, they can request someone in the Field CTO team to assist. – Greg Pavlik May 18 '22 at 11:43

1 Answers1

0

Try and Evaluate the following :

Step1 : Create External Tables on top of External Stage. Step 2: Create a Snowflake Stream (Standard) on the External Table to Find the (Insert,Update and Deletes) on the File Step3: Create a Stored Procedure [Merge statements on Target table to figure out (Inserts,Updates,Deletes) to Load into Target Table Step4: Schedule the Stored Prod using Snowflake Task.

Note: Consider the Source file management ( Landing into S3 and Post Ingestion into Snowflake , How to Archive the files in S3 etc).

Raj Gajula
  • 26
  • 4
  • Do you think that streams on the external tables might face performance issues? Considering that external as not as performant as the snowflake internal tables – Ioannis Agathangelos May 23 '22 at 12:12
  • There could be some performance challenges depending on the region of the s3 and Snowflake cloud - As per snowflake's recommendation - Source File storage and Snowflake being on same region may reduce network latency .(Again no hard and fast rule) . – Raj Gajula May 24 '22 at 13:59
  • I started investigating your approach, but I see that only insert-only streams can be set-up on top of external tables. Thus deletes and updates will not be tracked. An alternative could be to use directory tables, but as far as I understand these can only be set at stage level, not table level. Any thoughts on this? – Ioannis Agathangelos Jun 09 '22 at 12:53