Optimising creating a Dataframe from a Snowflake SQL table

Question

I have the below table in SQL which has 86M rows in it :

Transactions

I am trying to get it into a dataframe as from the code below

data = cs.execute("""
select * from transactions;
""").fetch_pandas_all()

This takes much too long to load.

What is a way I can make this load faster? Is there any method I can use? Should I create the table itself in the sql statement instead of a select? Any insight would be helpful.

It is interesting because to create this table in SQL it takes about 25 seconds. But when putting the same data into a dataframe - it takes about 15 minutes. So am thinking if there is a way to achieve the same speed as SQL in python.

what ever you do in python can be done in the server, so oyu can try ti implement it — nbk, Feb 22 '23 at 21:49
Can you please elaborate? Do you mean I can do all the same stuff in SQL? I am trying to build out pythonic machine learning algorthms from my snowflake tables — Michael Norman, Feb 22 '23 at 23:31

score 0 · Answer 1 · answered Feb 22 '23 at 04:18

There is fundamental difference in what you are doing and expexting.

case 1 - when using snowflake only - you are doing something like this

create table mytable as select ... from anothertable;

This is fast because it does a data movement from S3 to S3 and uses snowflake optimization of micro partitions etc.

case 2 - for Python pandas fetch_all() - You are trying to read data from snowflake and fetch it in local system(into pandas dataframe) where python is running. This means, all 86M data is moved over network and can take time.

So, here is what i can think to optimize py code -

Instead of fetch all, fetch some.
Apply filter to fetch only required rows.
Fetch only required columns instead of all.
Run python close to your Snowflake cluster like AWS or GCP using a machine with more RAM.
If all fails, you can try lazily evaluated pandas data frame which will not fetch all data untill you do a commit. So this is somewhat faster but at one point it will take time. Pls refer to below link - How to create lazy_evaluated dataframe columns in Pandas

The issue is, I do need all of the rows and columns and cannot make it closer to my cluster. What would your number 5 mean? I suppose I don't understand how that can be used I also found some info on creating the dataframe with chunks but not sure if that is the best way to go https://docs.snowflake.com/en/user-guide/python-connector-pandas#migrating-to-pandas-dataframes — Michael Norman, Feb 22 '23 at 18:54
Chunking and looping is something you can do. `lazy Frame` is something which will not hit the database unless you do some commit. so, create frame, select frame etc. wont take time but writing to table or any I/O operation will take time. — Koushik Roy, Feb 23 '23 at 05:54

Optimising creating a Dataframe from a Snowflake SQL table

1 Answers1