0

I am stumbling accross large files 80.000 lines + which I do have to keep in my database. It takes like 20-30 min to push it all to my mysql database. I have a simple for loop, which just loops the whole csv.

import csv
import MySQLdb

# open the connection to the MySQL server.
# using MySQLdb
mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    for row in csv_reader:
    if "PVP_BIG" and "DATE_ADD" in row:
        print "First line removed"
    else:
        print "Not found!"
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        cursor.execute(sql, val)
        print row
#close the connection to the database.
#mydb.commit()
cursor.close()
print "CSV has been imported into the database"




    

Is there any method, I can divide it like to make it concurrent, so it will take like maybe 3-5 minutes based on the computer hardware?

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
Denni
  • 31
  • 6
  • May be helpful. Try [executemany](https://stackoverflow.com/questions/42039794/inserting-a-list-holding-multiple-values-in-mysql-using-pymysql). – Savostyanov Konstantin Nov 17 '20 at 16:40
  • 1
    why dont you use `LOAD DATA INFILE ...`? – madzohan Nov 17 '20 at 16:40
  • also note that `if "PVP_BIG" and "DATE_ADD" in row:` will always eval to `True` – SuperStew Nov 17 '20 at 16:42
  • @SuperStew it won´t. I tested it. If both of them are found in a row of the csv, then it is True, otherwise it is not – Denni Nov 17 '20 at 16:44
  • @madzohan problem here is, the mysql connection is not local, so there is no real path I guess – Denni Nov 17 '20 at 16:44
  • You only have two tasks. No amount of concurrency will push them from 20-30 min down to 3-5 min – a factor of 1.5-2 *might* be doable, definitely not a factor of 6. – MisterMiyagi Nov 17 '20 at 16:45
  • @SuperStew ``if "PVP_BIG" and "DATE_ADD" in row:`` is equal to ``if "DATE_ADD" in row:``. It's wrong but sometimes it is correct. ;) – MisterMiyagi Nov 17 '20 at 16:46
  • 1
    ok then you can use https://dev.mysql.com/doc/refman/8.0/en/mysqlimport.html where you could specify `--host=host_name` `--password[=password]` etc – madzohan Nov 17 '20 at 16:51
  • anyway you have XY problem )) https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem – madzohan Nov 17 '20 at 16:54
  • @Denni i misspoke but that's incorrect. As MisterMiyagi pointed out, what you're saying is `(if "PVP_BIG") and (if "DATE_ADD" in row)`. The first part is always true – SuperStew Nov 17 '20 at 17:06

1 Answers1

0

First thing you may get a big speedup by removing the print(row) from your inner loop. Everything else in the program waits on this action and it is an IO action that can take much longer than you might think. Secondly you might find a significant speedup by batching your INSERT statements, ie inserting more than one row at a time, say 100 or so. Thirdly the best way to do this is probably something involving asyncio but I don't have much experience with it. You're likely IO bound talking to the DB and getting data from the csv file and never doing both at once so I'd go with a simple two thread solution like below:

import csv
import MySQLdb
import threading 
from queue import Queue




def row_insert_thread(q: Queue, cursor, mydb):
    while True:
        command = q.get()
        if command is None:
            cursor.close()
            #mydb.commit()
            break
        cursor.execute(*command)

mydb = MySQLdb.connect(host='hst', user='usr', passwd='pwd', db='db')
cursor = mydb.cursor()
        
insert_q = Queue()

row_thread = Thread(target=row_insert_thread,args=(insert_q,cursor,mydb)
row_thread.start()


with open('product_de.csv') as csv_file:
    csv_reader = csv.reader(csv_file, delimiter=';')
# execute and insert the csv into the database.
    next(csv_reader) #skip the header row I'm assuming there is only one 
    for row in csv_reader:
        sql = "INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)"
        val = (row[0], row[1],row[3],row[4], row[5],row[6], row[8], row[10], row[11], row[12], row[15], row[16], row[17], row[18], row[19], row[20])
        insert_q.put((sql, val))
        print row
#close the connection to the database.

insert_q.put(None)
row_thread.join()

print "CSV has been imported into the database"

    

For the insert statement I'm not used to MySQL going from sqlite experience here, I think this will work:

def insert_multiple_rows(cursor, rows:list):
    sql = f"INSERT INTO big (SKU,Category,Attribute1,Attribute2,Value1,Value2,Brand,Price,PVP_BIG,PVD,EAN13,WIDTH,HEIGHT,DEPTH,WEIGHT,Stock) VALUES {'(%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s),'*len(rows)}"[:-1]
    args = [col for col in [row for row in rows]]
    cursor.execute(sql,args)

I expect you can integrate this into your code if you want to use it just change the thread to take a list then in the main loop add values to the list until it reaches whatever number you want or you run out of rows, then put the list into the insert_q

David Oldford
  • 1,127
  • 4
  • 11