Snowflake : Copy comman not generating Constant SIze for multiple files while unloading

Question

copy into @elasticsearch/product/s3file from (select object_construct(*)from mytable) file_format = (type = json, COMPRESSION=NONE), overwrite=TRUE, single = False, max_file_size=5368709120;

the table has 2GB of data. I want to split them in 100mb files to be stored in S3, but s3 splits them uneven files sizes. Expected is to have multiple files having 100MB

I need to do performance improvement to index in elastic search, I'm using smart_open to do multiprocessing. so it will be convenient to handle files. Thanks

score 0 · Answer 1 · answered Aug 13 '20 at 13:22

You would only get identical file sizes if every value in each column was exactly the same size.

For example, if your table had firstname and lastname columns and one record had values of "John" "Smith" and another record had values of "Michael" "Gardner" then, if each record was written to a different file, the resulting JSON files would be different sizes as John is a different size to Michael and Smith is a different size to Gardner.

You can also only control the maximum size of a file, not the actual file size. If you had written 10 records to a file and that resulted in a file size of 950Mb, if the next record would be 100Mb in size then it would be written to a new file and the original fole would remain at 950Mb

score 0 · Answer 2 · answered Apr 15 '21 at 07:12

Its not S3 split the files its snowflake wharehosue size split the file as if you use SINGLE=False in copy command. As WH size grows number files will be increase

Example sppouse your running your query with XS size wh and it produce 8 files on s3 and if you use M size WH then it will create 16 file on s3. and its split happens in parallel mode so size may be vary for each file. Its not like it create file with max limit you have given in copy command and then start another file.

Snowflake : Copy comman not generating Constant SIze for multiple files while unloading

2 Answers2