Happy new year everyone!
I'm currently struggling with ETL performance issues as I'm trying to write larger Pandas DataFrames (1-2 mio rows, 150 columns) into an Oracle data base. Even for just 1000 rows, Panda's default to_sql()
method runs well over 2 minutes (see code snippet below).
My strong hypothesis is that these performance issues are in some way related to the underlying data types (mostly strings). I ran the same job on 1000 rows of random strings (benchmark: 3 min) and 1000 rows of large random floats (benchmark: 15 seconds).
def_save(self, data: pd.DataFrame):
engine = sqlalchemy.create_engine(self._load_args['con'])
table_name = self._load_args["table_name"]
if self._load_args.get("schema", None) is not None:
table_name = self._load_args['schema'] + "." + table_name
with engine.connect() as conn:
data.to_sql(
name=table_name,
conn=conn,
if_exists='replace',
index=False,
method=None# oracle dialect does not support multiline inserts
)
return
Anyone here how has experience in efficiently loading mixed data into an Oracle data base using python?
Any hints, code snippets and/or API recommendations are very much appreciated.
Cheers,