Large Dask/Pandas DataFrame (27M rows x 52 columns) .to_csv or .to_sql MemoryError

Question

Long story short, I've been struggling with memory issues (as well as time to execute) dask / pandas to write a large dataframe to_sql or to_csv if anyone can point me in the right direction (more detail below)?

I have appended two large .csv files together to form an even larger .csv file (27M rows x 52 columns) with dask in order to save memory instead of using pandas, where I was running out of memory or close to it.

From there, I am simply trying to write to a .csv, but I am again running out memory trying to use .compute() to transform from a dask dataframe to a pandas dataframe and write the file. I would like to focus this question on how I can simply get this thing into a .csv without running out of memory. From there, I plan to use SQL Server's import wizard to import the file to a new table in a database. See the code below for how I am reading and concat'ing as a dask dataframe, transforming to a pandas dataframe and then attempting to write to a .csv before running out of memory:

df1 = dd.read_csv('C:/Users/david.erickson/file1.csv', dtype={'Phone #': str, 'Called #': str})
df2 = dd.read_csv('C:/Users/david.erickson/file2.csv', dtype={'Phone #': str, 'Called #': str})
df_combined = dd.concat([df1,df2], axis=0).drop_duplicates()
df_combined = df_combined.compute()
df_combined.to_csv('C:/Users/david.erickson/file_combined.csv', index=False)

I am running into memory errors at either the fourth line or the fifth line after watching python approach 38 GB of Committed Physical Memory:

MemoryError: Unable to allocate 210. MiB for an array with shape (27571126,) and data type int64

Before anyone mentions it, I have also tirelessly tried to_sql as well with no success as the ultimate goal is to get this data into a SQL Sever database. It keeps taking forever and I am running into memory errors with sqlalchemy's to_sql. Even better, I wish I could write directly to SQL Server, and this what I tried:

engine = sa.create_engine(cnxn_string, fast_executemany=True)
connection = engine.connect()
df.to_sql('New_Table', con=engine, if_exists='replace', index=False)

I also tried: df6.to_sql('New_Table', con=engine, if_exists='replace', index=False, chunksize=40, method='multi') # There is a limit of 2098 parameters, which is why the chunksize=40 (40*52 columns=2080, so less than the limit. Anyway, this was slower than only passing fast_executemany=True to create_engine)

Unfortunately, the mentioned attempts as well as some other python sqlalchemy approaches that I researched with .to_sql just ran out of memory or the operation took forever (overnight), and I had to kill it.

Of the 52 columns and 27.5 million rows, I would guess that the average string size per cell is ~20 characters (at least for the text columns). There has to be something I can do differently to get this data into a SQL Server database, but I am really struggling I have never had issues with to_sql or to_csv before this project with much more data.

unfortunately, I am now 0 for 3. I just tried to pass `singlefile=True` per this full line of code: `df_combined.to_csv('C:/Users/david.erickson/file_combined.csv', index=False, single_file=True)` and received the same error: `MemoryError: Unable to allocate 210. MiB for an array with shape (27571126,) and data type int64` — David Erickson, Jun 12 '20 at 08:37
I used a new Virtual Machine from work with 10x Memory, and I was able to combine the files.It ended up taking ~50 GB of memory with the code in the comment above. I am now successfully (hopefully) reading in the combine 14.7 GB file into SQL Server with the Import Data wizard. — David Erickson, Jun 12 '20 at 10:50

score 1 · Answer 1 · answered Jun 12 '20 at 08:18

1

assuming the CSV have something like a primary key, split the dataset. If you have 52 columns, split it in 4 frames with 14 columns (4x 13 for each column, and duplicate the primary key column so you can match them) and then pass the smaller dataset to SQL.

or make a list of primary keys, and only insert the top N, commit the transaction, and start on the next batch.

also, MS SQL has an import function for .csv that tends to be fairly efficient since it batches its inserts. Why use python as a layer between the 2?

answered Jun 12 '20 at 08:18

Repr

201
1
3

I tried importing just one file, but I ran into errors with the format of one of the columns and just made the columns varchar(500) in case there were some rows that had a long string (it doesn't tell you which column has problematic data). That apparently helped, but then I ran into issues with the format for some of the date columns and the import got stuck, and I didn't try after that. On the topic of importing into a SQL table with SQL Server's tool, I'm not sure how I can import two files into one table, so I was trying to create one big file and then try again. – David Erickson Jun 12 '20 at 08:33
you can import them into the same table by doing 2 consecutive import jobs. assuming the column names and order match this shouldnt be a problem. if they dont match, or just so you got source tables before mergin you can also import them to seperate tables and then merge the 2 using regular SQL (use a loop for batching) do remember to remove all indexes from a target table before doing this. index is great for searching, terrible for bulk insert. its faster to rebuild them after the fact then to insert with them enabled – Repr Jun 12 '20 at 08:55
Thanks for your comments. Per the comment on my answer, I was able to get it working. For the next part of my problem, I might post a separate question,but I was hoping to get some people experienced with python/sqlalchemy/sqlserver to post some insight before accepting any answers. – David Erickson Jun 12 '20 at 10:52

Large Dask/Pandas DataFrame (27M rows x 52 columns) .to_csv or .to_sql MemoryError

1 Answers1