Long story short, I've been struggling with memory issues (as well as time to execute) dask
/ pandas
to write a large dataframe to_sql
or to_csv
if anyone can point me in the right direction (more detail below)?
I have appended two large .csv files together to form an even larger .csv file (27M rows x 52 columns) with dask
in order to save memory instead of using pandas
, where I was running out of memory or close to it.
From there, I am simply trying to write to a .csv
, but I am again running out memory trying to use .compute()
to transform from a dask dataframe to a pandas dataframe and write the file. I would like to focus this question on how I can simply get this thing into a .csv without running out of memory. From there, I plan to use SQL Server's import wizard to import the file to a new table in a database. See the code below for how I am reading and concat
'ing as a dask dataframe, transforming to a pandas dataframe and then attempting to write to a .csv before running out of memory:
df1 = dd.read_csv('C:/Users/david.erickson/file1.csv', dtype={'Phone #': str, 'Called #': str})
df2 = dd.read_csv('C:/Users/david.erickson/file2.csv', dtype={'Phone #': str, 'Called #': str})
df_combined = dd.concat([df1,df2], axis=0).drop_duplicates()
df_combined = df_combined.compute()
df_combined.to_csv('C:/Users/david.erickson/file_combined.csv', index=False)
I am running into memory errors at either the fourth line or the fifth line after watching python approach 38 GB of Committed Physical Memory:
MemoryError: Unable to allocate 210. MiB for an array with shape (27571126,) and data type int64
Before anyone mentions it, I have also tirelessly tried to_sql
as well with no success as the ultimate goal is to get this data into a SQL Sever database. It keeps taking forever and I am running into memory errors with sqlalchemy
's to_sql
. Even better, I wish I could write directly to SQL Server, and this what I tried:
engine = sa.create_engine(cnxn_string, fast_executemany=True)
connection = engine.connect()
df.to_sql('New_Table', con=engine, if_exists='replace', index=False)
I also tried: df6.to_sql('New_Table', con=engine, if_exists='replace', index=False, chunksize=40, method='multi')
# There is a limit of 2098 parameters, which is why the chunksize=40 (40*52 columns=2080, so less than the limit. Anyway, this was slower than only passing fast_executemany=True
to create_engine)
Unfortunately, the mentioned attempts as well as some other python sqlalchemy
approaches that I researched with .to_sql
just ran out of memory or the operation took forever (overnight), and I had to kill it.
Of the 52 columns and 27.5 million rows, I would guess that the average string size per cell is ~20 characters (at least for the text columns). There has to be something I can do differently to get this data into a SQL Server database, but I am really struggling I have never had issues with to_sql
or to_csv
before this project with much more data.