0

I have a file which has 3 millions record and I am using snowpipe auto ingest feature to load it automatically.

Wanted to know the behavior of snowpipe will it load 3 million records in stage in one shot then commit or it will be some sort incremental load like 10k,20k...3 million.

rjain550
  • 105
  • 2
  • 8

3 Answers3

1

Since SnowPipe runs a copy command under the hood, each file is a single transaction and if there is any data issue, the copy will follow your file format and other properties to see if the partial load is allowed or not.

There are two factors that are essential to ingesting data via SnowPipe faster

  1. File size (better to have small files > 250Mb), bigger the file, slower the response, and the chance of failure is high
  2. File Format (in my experience, CSV works better)

The data latency will be around ~15 to 30sec I have simulated this and works very well with 50-100Mb files within ~20sec.

Alternatively, if the file size is big, then follow the external table with auto-refresh & have a task associated with it, and load data via copy command. But the task's min frequency is 1min. So your latency is always 1+ min.

H Roy
  • 597
  • 5
  • 10
0

with Snowpipe, loads are combined or split into a single or multiple transactions based on the number and size of the rows in each data file. This is different to COPY where load is always performed in a single transaction (all or nothing). So it is possible with Snowpipe you might start seeing some data, before the entire file is loaded. Therefore do not rely on file-granularity transactions with Snowpipe. Files is a chunking mechanism for continuous data loading. A file itself may be loaded in multiple chunks.

Snowpipe is designed to load new data typically within 1 min after file notification is sent, but loading can take longer for really large files.

Most efficient and cost-effective load with Snowpipe is achieved when file sizes are around 100-250MB in size. If it takes longer than 1 min to accumulate MBs of data in your application, consider aggregating the data to create a new data file within the 100-250MB range. This leads to good balance between cost and performance (load latency).

Ref: https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#continuous-data-loads-i-e-snowpipe-and-file-sizing

Asim Rout
  • 73
  • 7
0

As per the recommendations, the file size should be in the range of 100-250MB compressed to load the files (data loading). Loading very large files (e.g. 100 GB or larger) is not recommended.

Snowpipe works in parallel mode, its used to load continuous streaming data in micro-batches in near real-time (as soon the file lands in S3, Azure blob etc). The number of data files that can be processed in parallel is determined by the number and capacity of servers/nodes in a warehouse.

For ad-hoc queries/load/one time, you can use the COPY INTO command.

Loading a single huge file is not recommended via snowpipe.

If you try to ingest a huge single file with 3 million rows, it will not be able to use parallel mode, even if you use large warehouse, it will not help boost performance because the number of load operations that run in parallel cannot exceed the number of data files to be loaded. So for single file load it will use single node from the warehouse, rest of the nodes will not be used.

So if you want to use snowpipe auto-ingest, please split the large file into smaller sizes (100-250MB). Splitting larger data files allows the load to scale linearly.

Please refer to these links for more details

https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html#general-file-sizing-recommendations

https://docs.snowflake.com/en/user-guide/data-load-considerations-prepare.html