I'm currently trying to tune the performance of a few of my scripts a little bit and it seems that the bottleneck is always the actual insert into the DB (=MSSQL) with the pandas to_sql function.
One factor which plays into this is mssql's parameter limit of 2100.
I establish my connection with sqlalchemy (with the mssql + pyodbc flavour):
engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params, fast_executemany=True)
When inserting I use a chunksize (so I stay below the parameter limit and method="multi"):
dataframe_audit.to_sql(name="Audit", con=connection, if_exists='append', method="multi",
chunksize=50, index=False)
This leads to the following (unfortunately extremely inconsistent) performance:
I'm not sure what to think of this exactly:
- Inconstistency seems to stem from the DB Server itself
- A greater chunksize seems not to translate to a better performance (seems to be the other way round!?)
- Maybe I should switch from pyodbc to turbodbc (according to some posts it yields better performance)
Any ideas to get a better insert performance for my DataFrames?