0

I'm trying to import data from a .xlsx file into a SQL database. Right now, I have a python script which uses the openpyxl and MySQLdb modules to

  • establish a connection to the database
  • open the workbook
  • grab the worksheet
  • loop thru the rows the the worksheet, extracting the columns I need and inserting each record into the database, one by one

Unfortunately, this is painfully slow. I'm working with a huge data set, so I need to find a faster way to do this (preferably with Python). Any ideas?

wb = openpyxl.load_workbook(filename="file", read_only=True)
ws = wb['My Worksheet']

conn = MySQLdb.connect()
cursor = conn.cursor()

cursor.execute("SET autocommit = 0")

for row in ws.iter_rows(row_offset=1):
     sql_row = # data i need
     cursor.execute("INSERT sql_row")

conn.commit() 
Brinley
  • 591
  • 2
  • 14
  • 26
  • 3
    Can you include an example snippet of your current code? What version of SQL are you using? The solution is likely to turn off autocommit in SQL or alter the exact way you enter rows into the DB, depending on what you're doing in the status quo. – Alex Alifimoff Jun 26 '17 at 15:09

2 Answers2

0

Disable autocommit if it is on! Autocommit is a function which causes MySQL to immediately try to push your data to disk. This is good if you only have one insert, but this is what causes each individual insert to take a long time. Instead, you can turn it off and try to insert the data all at once, committing only once you've run all of your insert statements.

Something like this might work:

con = mysqldb.connect(
                    host="your db host",
                    user="your username",
                    passwd="your password",
                    db="your db name"
                 )
con.execute("SET autocommit = 0")
cursor = con.cursor()
data = # some code to get data from excel
for datum in data:
    cursor.execute("your insert statement".format(datum))

con.commit()
con.close()
Alex Alifimoff
  • 1,850
  • 2
  • 17
  • 34
  • Thank you the help. Inserting is a little faster but still taking several minutes to finish (my data set has 200000+ rows). Any other suggestions? – Brinley Jun 26 '17 at 15:38
  • Can you please your code in your original post? This was my best guess, but there might be something more obvious if we see your code! – Alex Alifimoff Jun 26 '17 at 15:41
  • Okay. see above. – Brinley Jun 26 '17 at 15:55
  • @Brinley, can you try calling `con.commit()` immediately after the autocommit set and see if that speeds anything up? – Alex Alifimoff Jun 26 '17 at 16:02
  • Still quite slow. – Brinley Jun 26 '17 at 16:06
  • And presumably if you actually comment out the insert statement, it speeds up quite a bit? Just want to make sure that the problem is actually with the insert statement. – Alex Alifimoff Jun 26 '17 at 16:08
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/147637/discussion-between-alex-alifimoff-and-brinley). – Alex Alifimoff Jun 26 '17 at 16:13
  • It's still pretty slow actually. I think it might be ws.iter_rows() that's slowing it down. ws.iter_rows() is the function openpyxl gives me to iterate over the rows in a spreadsheet. Perhaps there's a faster way to do this? – Brinley Jun 26 '17 at 16:14
  • There is a [variety of discussion on a couple of SO posts](https://stackoverflow.com/questions/35823835/reading-excel-file-is-magnitudes-slower-using-openpyxl-compared-to-xlrd) about the speed of openpyxl. One option that might be even easier would be to use pandas. MaxU in the linked post has a pandas example. – Alex Alifimoff Jun 26 '17 at 16:18
0

Consider saving workbook's worksheet as a CSV, then use MySQL's LOAD DATA INFILE. This is often a very fast read.

sql = """LOAD DATA INFILE '/path/to/data.csv' 
         INTO TABLE myTable  
         FIELDS TERMINATED BY ',' 
         OPTIONALLY ENCLOSED BY '\"'
         LINES TERMINATED BY '\n'"""  

cursor.execute(sql)
con.commit()
Parfait
  • 104,375
  • 17
  • 94
  • 125