3

I am trying to bulk insert a .CSV file into SQL Server without much success.

A bit of background:

1. I needed to insert 16 million records into a SQL Server (2017) DB. Each record has 130 columns. I have a field in the .CSV resulting from an API call from one of our vendors which I am not allowed to mention. I had integer, floats and strings data types.

2. I tried the usual: BULK INSERT but I could not get passed the data type errors. I posted a question here but could not make it work.

3. I tried experimenting with python and tried every method I could find but pandas.to_sql for everybody warned it was very slow. I got stuck with data type and string truncate errors. Different to the ones from BULK INSERT.

4. Without much options I tried pd.to_sql and while it did not raise any data type or truncation errors it was failing due to running out of space in my tmp SQL database. I could not pass this error either although I had plenty of space and all my data files (and log files) were set to autogrowth without limit.

I got stuck at that point. My code (for the pd.to_sql piece) was simple:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://@myDSN")

df.to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)

I am not really sure what else to try, any word of advice is welcome. All codes and examples I have seen deal with small datasets (not many columns). I am willing to try any other method. I would appreciate any pointers.

Thanks!

Wilmar
  • 558
  • 1
  • 5
  • 16
  • 2
    Stack Overflow is a Question and Answer website. There's nothing wrong with answering your own question, however, don't post the answer and question in the question. This actually lacks a question and is written like a blog; which isn't what Stack overflow is for. If you *do* want to share a problem you tried to solve along with an answer that is completely welcomed, but you need to do so by posting an answerable question, and then your own answer as an answer. – Thom A Oct 18 '20 at 16:02
  • to comply to StackOverflow's format you could have written both a question and an answer – Walter Tross Oct 18 '20 at 16:02
  • 1
    Alright le me edit this. To make it that way! Thanks! – Wilmar Oct 18 '20 at 16:03
  • See [Can I answer my own question?](https://stackoverflow.com/help/self-answer) and (more importantly) the linked blog [It’s OK to Ask and Answer Your Own Questions](https://stackoverflow.blog/2011/07/01/its-ok-to-ask-and-answer-your-own-questions/). As it stands, this is more likely to be locked and receive downvotes I am afraid, as it doesn't meet the question guidelines for the Stack Exchange network, and Stack Overflow. – Thom A Oct 18 '20 at 16:04
  • Thanks @Larnu. I changed the format: separated question and answer. I can delete the whole post. I just wanted to share. – Wilmar Oct 18 '20 at 16:12
  • Thanks @WalterTross, I just edited the question to comply with the format. – Wilmar Oct 18 '20 at 16:15
  • Answer is separate, great, but the question needs you are lacking a question in your question. :) – Thom A Oct 18 '20 at 16:17
  • How about now @Lamu? – Wilmar Oct 18 '20 at 16:21

3 Answers3

2

I just wanted to share this dirty piece of code just in case it helps anybody else. Note that I am very aware this is not optimal at all, it is slow but I was able to insert about 16 million records in ten minutes without overloading my machine.

I tried doing it in small batches with:

import pandas as pd
from sqlalchemy import create_engine

engine = create_engine("mssql+pyodbc://@myDSN")

a = 1
b = 1001

while b <= len(df):
    try:
        df[a:b].to_sql('myTable', engine, schema='dbo', if_exists='append',index=False,chunksize=100)
        a = b + 1
        b = b + 1000
    except:
        print(f'Error between {a} and {b}')
        continue

Ugly as hell but worked for me.

I am open to all critics and advises. As I mentioned, I am posting this in case it helps anybody else but also looking forward to receive some constructive feedback.

Wilmar
  • 558
  • 1
  • 5
  • 16
1

Loading data from pandas data frame to SQL database is very slow and when dealing with large datasets, running out of memory is a usual case. You want something that is much efficient than that when dealing with large datasets.

d6tstack is something that might solve your problems. Because it works with native DB import commands. It is a custom library that is specifically built for dealing with schema as wells as perfomance issues. Works for XLS, CSV, TXT which can be exported to CSV, Parquet, SQL and Pandas.

  • Thank you @Dashrath. It seems promising. I will try it out. – Wilmar Oct 19 '20 at 18:01
  • 1
    @Wilmar, you might need to do some extra brainstorming. Reason for that is, last time I used this library, it was stable for PostgreSQL and MySQL, but was in experimental state for MSSQL, but its been a while now, I am sure that the community must have fixed that by now. Good luck with this. If stuck anywhere feel free to drop a comment over here, not too good with this things but will surely try to help. – Dashrath Chauhan Oct 20 '20 at 04:58
1

I think df.to_sql is pretty awesome! I have been using it a lot lately. It's a bit slow, when data sets are really huge. If you need speed, I think Bulk Insert will be the fastest option. You can even do the job in batches, so you don't run out of memory, and perhaps overwhelm you machine.

BEGIN TRANSACTION
BEGIN TRY
BULK INSERT  OurTable 
FROM 'c:\OurTable.txt' 
WITH (CODEPAGE = 'RAW', DATAFILETYPE = 'char', FIELDTERMINATOR = '\t', 
   ROWS_PER_BATCH = 10000, TABLOCK)
COMMIT TRANSACTION
END TRY
BEGIN CATCH
ROLLBACK TRANSACTION
END CATCH 
ASH
  • 20,759
  • 19
  • 87
  • 200