I'm using Pandas read sql to read netezza table through jdbc/jaydebeapi.
start_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
print(" DB Start Date Time is " + str(start_time))
int_df = pd.read_sql(query,conn)
end_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
print(" DB End Date Time is " + str(end_time))
the query is a simple select * from database.table
I'm reading a table with 700K rows that and create a csv (size 600 MB on creation); The read sql is taking a godawful time to read the database table. DB Start Date Time is 2020-08-03 10:26:11.317 DB End Date Time is 2020-08-03 11:15:19.841
As you see Its almost an hour. Why does this take so long to read?
I can read this table in a few seconds using nzsql and just 5 seconds using spark dataframe (including an action of wrting to csv). what could be the error? I have no transformations- just a plain extract;
My server has no memory issues; It has over 200G of free memory space and hardly any cpu usage while running the process. What is bottleneck? Any other better methods to do this?