2

Happy new year everyone!

I'm currently struggling with ETL performance issues as I'm trying to write larger Pandas DataFrames (1-2 mio rows, 150 columns) into an Oracle data base. Even for just 1000 rows, Panda's default to_sql() method runs well over 2 minutes (see code snippet below).

My strong hypothesis is that these performance issues are in some way related to the underlying data types (mostly strings). I ran the same job on 1000 rows of random strings (benchmark: 3 min) and 1000 rows of large random floats (benchmark: 15 seconds).

def_save(self, data: pd.DataFrame):
    engine = sqlalchemy.create_engine(self._load_args['con'])
    table_name = self._load_args["table_name"]

    if self._load_args.get("schema", None) is not None:
        table_name = self._load_args['schema'] + "." + table_name

    with engine.connect() as conn:
        data.to_sql(
            name=table_name,
            conn=conn,
            if_exists='replace',
            index=False,
            method=None# oracle dialect does not support multiline inserts
        )
    return

Anyone here how has experience in efficiently loading mixed data into an Oracle data base using python?

Any hints, code snippets and/or API recommendations are very much appreciated.

Cheers,

  • 4
    For very large DataFrames you will likely get the best performance if you dump the DataFrame to a CSV file (or similar) and then shell out to run [SQL*Loader](https://docs.oracle.com/cd/B19306_01/server.102/b14215/ldr_concepts.htm). – Gord Thompson Jan 05 '21 at 23:19

1 Answers1

3

As said in your question, you are not able to use method='multi' with you db flavor. This is the key reason inserts are so slow, as data going in row by row.

Using SQL*Loader as suggested by @GordThompson may be fastest route for relatively wide/big table. Example on setting up SQL*Loader

Another option to consider is cx_Oracle. See Speed up to_sql() when writing Pandas DataFrame to Oracle database using SqlAlchemy and cx_Oracle

Ilya Berdichevsky
  • 1,249
  • 10
  • 24