Python3 dataframe to aws redshift database

Question

I am new python and stuck with an issue which as stated below,please pardon my ignorance

Problem statement:- Python3 dataframe holds values(approx 1 million rows and have few thousand columns) which needed to saved in redshift database

Solutions looked into :- 1> As per documentation the quickest method to perform such operation is using copy command. But in my case the table structure keeps on changing .So this solution is not correct fit for me

2> Tried to save dataframe directly using to_sql but it takes forever to complete it .Have tried same using chunksize ,But that also didn't worked(it takes forever to complete .I had left the script for 2-3 hrs but still it was running)

df.to_sql('a_265', conn, index = False, if_exists = 'replace') df.to_sql('a_265', conn, index = False, if_exists = 'replace', chunksize=10)

3> Tried to look into another option available odo.But this also needs to have a table structure present.Since this also uses copy command in backgroud

So are there any solutions available which will enable me to achieve bulk upload of csv into redshift

*Note :- Table structure keeps on changing .Hence if suggesting copy command please update me also how to handle changes in table structure

"But that also didn't worked" - can you be more specific about what happened when you tried this? Did it not finish running? Generated an error? — Nathan Griffiths, Jul 24 '17 at 20:28
Relevant answer: https://stackoverflow.com/a/45172982/7306999 . — Xukrao, Jul 24 '17 at 20:32
@Xukrao:- thanks for pointing to the relevant answer.Since i am naive Can you just explain the steps after con.cursor() — ankit khanduri, Jul 25 '17 at 04:54
@ankitkhanduri I don't know much on this topic either. Your best bet is to ask the author of that answer directly. — Xukrao, Jul 25 '17 at 13:20

score 0 · Answer 1 · answered Jul 25 '17 at 14:13

0

Oh! to_sql on million rows is scary because when pandas submit the SQL command, it's not doing one insert with 1M records, it's inserting each record individually then waiting for the ACK before sending the next.

Solution - Assuming table structure changes for every new file. Instead of making multiple calls to a single insert statement, USE MULTI-ROW INSERT. Multi-row inserts improve performance by batching up a series of inserts. Depending on the size of your data, you can split your data-frame and then create your multi-row insert accordingly.

If I’ve made a bad assumption please comment and I’ll refocus my answer.

answered Jul 25 '17 at 14:13

Rahul Gupta

1,744
2
17
28

Thanks for solution But i am not sure multi row insert would be correct approach for me Since i have few thousands columns currently available and these columns would keep on increasing Please suggest your inputs if i should go ahead with multi row all i need to do is upload create table with values of dataframe – ankit khanduri Jul 25 '17 at 14:54
Multi row insert performance is very similar to that of the copy command. In my opinion multi row insert is the best option for your use case. – Rahul Gupta Jul 26 '17 at 13:51

Python3 dataframe to aws redshift database

1 Answers1