I am trying to write an ETL process to update the data from local machine to the SQL server using Python. The file that I am working on, has 46 columns and around 850K records. I tried without fast_executemany=True first but that took almost 3 hours. I read online and added fast_executemany=True and tested out with 10K record for both. Fast_executemany=True only improved by 10-15 seconds for 10K records. I have read online that people have used this and it took them only 5 minutes to insert almost a million rows.
Below is my python script.
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
chunksize=math.floor(2000/len(test.columns)),
#chunksize=1000,
method='multi',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
Is there anything that I'm missing here or is there any faster way to do this. I cannot use BULK INSERT due to access issues.
Edited
I saw an article where it stated that fast_executemany uses multi-inserted as a method by default and in the above section I was also mentioning the chunk size and method='multi' that's why it was not working properly. After I removed the chunksize and method, the time to insert 850K records dropped from 3 hours to around 25 minutes.
Below is my code.
print('Time Starts: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
# Variable names
Server='server'
Database='db'
Driver='SQL Server Native Client 11.0'
Database_Con = 'mssql+pyodbc://@' + Server + '/' + Database + '?trusted_connection=yes&driver=' + Driver
# Creating an engine connection
engine=create_engine(Database_Con, fast_executemany=True).connect()
test.to_sql('TEST123',
engine,
schema='dbo',
if_exists='append',
index=False
)
print('Time Ends: ', datetime.now().strftime("%m/%d/%Y %H:%M:%S"))
Although this is much better than before but I feel like there has to be something more faster. I don't think it should take more than 5 minutes.