0

I have a table that has around 400k rows and I am trying to bulk update 800 of them into two batches of 400 records each(sql server database). Database table has no triggers or indexes expect the default one on the primary id key.

  • When I use the bulk_update_mappings from sqlalchemy it takes around 40sec to complete the two updates.

  • When I write my own raw query which is just a bunch of UPDATES one after another like this:

def bulk_update(data, db_session):
    sql = ""
    for d in data:
        sql += f"""UPDATE table SET <some field updates and parsing to correct db format> WHERE id = {d['id']} \n"""
    db_session.exec(text(sql))
    db_session.commit()

It takes 0.2sec for the whole thing to complete.

I tried a bunch of other approaches like using the update(model).where(...).values(...) but that was almost as slow as the bulk one.

I tried to profile this and it seems like the bulk_update_mappings is calling the 'executemany' which takes the most time and I didn't see that when I do the raw query. Does anyone have any idea why there is such a big difference between the two methods?

Is it fine to use the raw query or that would introduce a architecture flaw as I now manually have to keep track of fields and parse them in the update statement to the correct format instead of using the model of the tabel from SQLModel directly?

  • Have you tried using `create_engine(connection_url, fast_executemany=True)` when using the bulk update mappings approach? – Gord Thompson Jun 13 '23 at 17:44
  • @GordThompson This actually does the job. It's still just a fraction slower than the brute force updates which is interesting but it speeds up things by bearable amount, thanks! Wonder why not make it auto detect if you are on a compatible driver for it to be applied and just turn it on in that case by default. – Martin Todorov Jun 13 '23 at 18:19

0 Answers0