1

I'm currently trying to tune the performance of a few of my scripts a little bit and it seems that the bottleneck is always the actual insert into the DB (=MSSQL) with the pandas to_sql function.

One factor which plays into this is mssql's parameter limit of 2100.

I establish my connection with sqlalchemy (with the mssql + pyodbc flavour):

engine = sqlalchemy.create_engine("mssql+pyodbc:///?odbc_connect=%s" % params, fast_executemany=True)

When inserting I use a chunksize (so I stay below the parameter limit and method="multi"):

dataframe_audit.to_sql(name="Audit", con=connection, if_exists='append', method="multi",
                               chunksize=50, index=False)

This leads to the following (unfortunately extremely inconsistent) performance: enter image description here

I'm not sure what to think of this exactly:

  • Inconstistency seems to stem from the DB Server itself
  • A greater chunksize seems not to translate to a better performance (seems to be the other way round!?)
  • Maybe I should switch from pyodbc to turbodbc (according to some posts it yields better performance)

Any ideas to get a better insert performance for my DataFrames?

zeitghaist
  • 89
  • 9

1 Answers1

2

If you are using the most recent version of pyodbc with ODBC Driver 17 for SQL Server and fast_executemany=True in your SQLAlchemy create_engine call then you should be using method=None (the default) in your to_sql call. That will allow pyodbc to use an ODBC parameter array and give you the best performance under that setup. You will not hit the SQL Server stored procedure limit of 2100 parameters (unless your DataFrame has ~2100 columns). The only limit you would face would be if your Python process does not have sufficient memory available to build the entire parameter array before sending it to the SQL Server.

The method='multi' option for to_sql is only applicable to pyodbc when using an ODBC driver that does not support parameter arrays (e.g., FreeTDS ODBC). In that case fast_executemany=True will not help and may actually cause errors.

Gord Thompson
  • 116,920
  • 32
  • 215
  • 418
  • Hi, that is currently my setup (Freetds on RHEL). Tried it out without the fast_executemany parameter and at least got roughly the same performance. Would it in this case help my case if I were using the ODBC Driver 17 for Microsoft? – zeitghaist Dec 22 '19 at 17:51
  • Yes. FreeTDS ODBC does not support `fast_executemany=True`, but Microsoft's "ODBC Driver 17 for SQL Server" does. – Gord Thompson Dec 22 '19 at 18:00
  • 1
    Just as a final comment, got around to installing the mssql driver on centos and testing with a 100k records file with the optimizations mentioned by Gord Thompson) --> from roughly 2 minutes down to 10 seconds :) – zeitghaist Jul 06 '20 at 11:24