1

I'm using Pandas read sql to read netezza table through jdbc/jaydebeapi.

start_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
print(" DB Start Date Time is " + str(start_time))

int_df = pd.read_sql(query,conn)

end_time = datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S.%f')[:-3]
print(" DB End Date Time is " + str(end_time))

the query is a simple select * from database.table

I'm reading a table with 700K rows that and create a csv (size 600 MB on creation); The read sql is taking a godawful time to read the database table. DB Start Date Time is 2020-08-03 10:26:11.317 DB End Date Time is 2020-08-03 11:15:19.841

As you see Its almost an hour. Why does this take so long to read?

I can read this table in a few seconds using nzsql and just 5 seconds using spark dataframe (including an action of wrting to csv). what could be the error? I have no transformations- just a plain extract;

My server has no memory issues; It has over 200G of free memory space and hardly any cpu usage while running the process. What is bottleneck? Any other better methods to do this?

1 Answers1

0

Ran into a similar issue...query against a 20 billion row table taking 20 mins to run through read_sql when it takes only 25s on average through SSMS. I’m not sure why (how I found your question because I’ve been trying to figure that part out), but adding chunksize as a parameter worked wonders. I set it to 10**5. Queries dropped back down to a 22 second average.

int_df = pd.read_sql(query,conn, chunksize=10**5)

Hope this might be of some help.

ww2406
  • 109
  • 9
  • Thanks for your response. I used the chunksize of 10*5 but was not successful in decreasing the processing time. My table is not any order of magnitude as big; It has 4 mill rows but taking for ever to complete- I ran it for a whole 24 hours with no avail. – matterwaves Aug 28 '20 at 04:00
  • Hi! Have you figured out how to fix this issue? @matterwaves – semenchukou Nov 26 '20 at 11:21
  • No- I had to use a different mechanism. Since we had sqoop available in our environments, i had to forego pandas and use sqoop. That worked perfectly – matterwaves Dec 11 '20 at 16:38