1

I have a loop that goes through each file (xlsx and csv) on a directory, reads them, append them and concatenate it in list. That list is them sent to a database. It works, but hen it comes to bigger files, it can take a lot of time. I figure this is the part where I need to improve my code:

        xls.to_sql(table, con=engine, if_exists='append', index=False)

I've been testing different chunksizes, but how can I choose the optimal chunksize when there's all kinds of files' sizes?

I've also been studying parallelization and was wondering if anyone could help me figure out how to improve this.

EDIT: By the way, I've also followed this blogpost to improve my connection to the database and all. And I've noticed a lot of perfomance improvements. But to_sql is still lacking performance.

https://towardsdatascience.com/dramatically-improve-your-database-inserts-with-a-simple-upgrade-6dfa672f1424

Catarina Ribeiro
  • 562
  • 2
  • 16

1 Answers1

1

This is what we did in our project:

chunk_size = 2100 // len(df.columns) - 1

And used method = 'multi' in to_sql as well. This has worked out pretty well for us. method = 'multi' speeds things up quite a bit as it takes multiple insert statements at once.

NYC Coder
  • 7,424
  • 2
  • 11
  • 24
  • what is the purpose of that piece of code? If you'll please explain – Catarina Ribeiro Apr 21 '22 at 08:41
  • 1
    This was a tool we built to read data from any database and write to any database. The size of the table, number of columns, rows were variable. A table with more columns needs a smaller chunk size, that’s why we divide it by number of columns. After many tests, we found this formula to be optimal. 2100 // len (df.columns) – NYC Coder Apr 21 '22 at 11:18
  • 1
    Multi Method is a must. If you are using Microsoft server then there’s another option called `fast_executemany=True` – NYC Coder Apr 21 '22 at 11:19
  • It’s still not as fast as using `bcp` on Linux, which inserts millions of records in seconds. And it will be never be. Don’t expect those kind of insert speeds with to_sql – NYC Coder Apr 21 '22 at 11:25
  • Thank you for the explanation! I'll try this for the chunksize... I've also discovered that my to_sql was a little too slow because there might be a problem when reading some of the Excel Files that I get the information from. I get a warning **warning workbook contains no default style apply openpyxl's default** on some of the files and it stays stuck there for a long time – Catarina Ribeiro Apr 21 '22 at 14:19