Speed up insertion of pandas dataframe using fast_executemany Python pyodbc

Question

I am trying to insert data contained in a .csv file from my pc to a remote server. The values are inserted in a table that contains 3 columns, namely Timestamp, Value and TimeseriesID. I have to insert approximately 3000 rows at a time, therefore I am currently using pyodbc and executemany.

My code up to now is the one shown below:

with contextlib.closing(pyodbc.connect(connection_string, autocommit=True)) as conn:
    with contextlib.closing(conn.cursor()) as cursor:
        cursor.fast_executemany = True  # new in pyodbc 4.0.19

            # Innsert values in the DataTable table
            insert_df = df[["Time (UTC)", column]]
            insert_df["id"] = timeseriesID
            insert_df = insert_df[["id", "Time (UTC)", column]]
            sql = "INSERT INTO %s (%s, %s, %s)  VALUES (?, ?, ?)" % (
                sqltbl_datatable, 'TimeseriesId', 'DateTime', 'Value')

            params = [i.tolist() for i in insert_df.values]
            cursor.executemany(sql, params)

As I am using pyodbc 4.0.19, I have the option fast_executemany set to True, which is supposed to speed up things. However, for some reason, I do not see any great improvement when I enable the fast_executemany option. Is there any alternative way that I could use in order to speed up insertion of my file?

Moreover, regarding the performance of the code shown above, I noticed that when disabling the autocommit=True option, and instead I included the cursor.commit() command in the end of my data was imported significantly faster. Is there any specific reason why this happens that I am not aware of?

Any help would be greatly appreciated :)

Perhaps this http://turbodbc.readthedocs.io/en/latest/ could be a better solution. If you're curious you could watch this https://www.youtube.com/watch?v=B-uj8EDcjLY — Ignacio Vergara Kausel, Jan 22 '18 at 13:26

score 3 · Answer 1 · answered Jan 22 '18 at 14:08

Regarding the cursor.commit() speed up that you are noticing: when you are using autocommit=True you are requesting the code to execute one database transaction per each of the insert. This means that the code resumes only after the database confirms the data is stored on disk. When you use cursor.commit() after the numerous INSERTs you are effectively executing one database transaction and the data is stored in RAM in the interim (it may be written to disk but not all at the time when you instruct the database to finalize the transaction).

The process of finalizing the transaction typically entails updating tables on disk, updating indexes, flushing logs, syncing copies, etc. which is costly. That is why you observe such a speed up between the 2 scenarios you describe. When going the faster way please note that until you execute cursor.commit() you cannot be 100% sure that the data is in the database so there may be a need to reissue the query in case of an error (any partial transaction is going to be rolled back).

Speed up insertion of pandas dataframe using fast_executemany Python pyodbc

1 Answers1