0

I am trying to write an ETL process to update the data from local machine to the SQL server using Python. The file that I am working on, has 46 columns and around 850K records. I tried without fast_executemany=True first but that took almost 3 hours. I read online and added fast_executemany=True and tested out with 10K record for both. Fast_executemany=True only improved by 10-15 seconds for 10K records. I have read online that people have used this and it took them only 5 minutes to insert almost a million rows.

Below is my python script.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  chunksize=math.floor(2000/len(test.columns)), 
                  #chunksize=1000,
                  method='multi',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

Is there anything that I'm missing here or is there any faster way to do this. I cannot use BULK INSERT due to access issues.

Edited

I saw an article where it stated that fast_executemany uses multi-inserted as a method by default and in the above section I was also mentioning the chunk size and method='multi' that's why it was not working properly. After I removed the chunksize and method, the time to insert 850K records dropped from 3 hours to around 25 minutes.

Below is my code.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))



# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

Although this is much better than before but I feel like there has to be something more faster. I don't think it should take more than 5 minutes.

hkay
  • 159
  • 1
  • 2
  • 12
  • Are you communicating with the target database over a network? If so, is it a LAN connection or a WAN/cloud connection? – Gord Thompson Jun 04 '22 at 12:32
  • I am working from home and connecting to the company's network to be able to connect to the database. I believe I have LAN but I'm not sure. – hkay Jun 04 '22 at 12:46
  • That would be an Internet (WAN) connection, even if you are connecting over a VPN to make it appear that you are on the company network. Internet connections can be slow, and using a VPN can make them even slower. Granted, 567 rows/sec (25 minutes) is rather pokey, but it's better than 79 rows/sec (3 hours). – Gord Thompson Jun 04 '22 at 13:16
  • Gotcha. Do you think that's the best it can get especially if I'm connecting over a VPN? – hkay Jun 04 '22 at 15:02
  • Actually, your VPN connection doesn't seem so bad. When I push 100_000 rows x 45 columns over my LAN via WiFi (~72 Mbps) I only move 700 rows/sec. But then again the machines I am using are fairly ancient (10 year-old client with 8 GB RAM and 13 year-old server with 4 GB RAM). – Gord Thompson Jun 04 '22 at 15:51
  • You're right. I tested it out again this morning and it took 18 minutes to be specific. I will post my script as an answer along with your comment for the future if someone needs an answer/explanation. Thanks @GordThompson – hkay Jun 04 '22 at 16:13
  • Out of curiosity I tried the JSON trick described [here](https://github.com/mkleehammer/pyodbc/wiki/Tips-and-Tricks-by-Database-Platform#passing-row-oriented-parameter-data-as-a-json-string) but in this case it was significantly slower than `.to_sql()` with `fast_executemany=True` (~490 rows/sec vs. the previous 700 rows/sec). The JSON approach *can* be faster, but perhaps not for tables with more than just a few columns. – Gord Thompson Jun 04 '22 at 18:00
  • I think the problem is the number of columns. – hkay Jun 05 '22 at 12:22

1 Answers1

1

I am posting my findings here as an answer. I had to remove the chunksize and method from test.to_sql because fast_executemany sets the method to multi-insert by default based on the article I read. It reduced the timing from 3 hours down to 20 minutes more or less.

To Gord Thompson's point, since I am using the VPN connection to connect to company's network that's why it may be comparatively slower but 20 minutes is still not bad.

print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))

# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver


# Creating an engine connection 
engine=create_engine(Database_Con, fast_executemany=True).connect()


test.to_sql('TEST123',
                  engine,
                  schema='dbo',
                  if_exists='append',
                  index=False
                  )

print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
hkay
  • 159
  • 1
  • 2
  • 12
  • If you're interested you could try [this](https://gist.github.com/gordthompson/1fb0f1c3f5edbf6192e596de8350f205) and see if it performs any better in your particular case. – Gord Thompson Jun 05 '22 at 12:51
  • I am going try this as well. I liked the way you handled the types there too. – hkay Jul 09 '22 at 15:09
  • So you are saying that removing the chunksize method improved the performance of your query? I have the same problem: upload a big table to a Sql Server through a VPN, and it's super slow. – bellotto Apr 13 '23 at 23:13