I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:
id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value
Each receptor of the file is already in a table and have a unique id.
I need to upload each line to a table with this format:
id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1
Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.
For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.
I'm dropping the foreign keys of the destination table and restoring them after the upload.
Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.
Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?
Edit:
Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).
# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
# the run_specs variable consists of the attributes defining a run
# (id_run, run_name, etc.). So basically a line in the CSV file without the
# receptors data
sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
for rec_i, rec in enumerate(rec_uids):
# the rec_uids variable is the unique IDs in the database for the
# receptors in the CSV file
line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
lines.append(line_to_write)
# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
tmp_data.writelines(lines)
# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys
# drop the foreign keys
for key in rr_fks:
DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs
# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
" ", False)
# restore foreign keys
for key in rr_fks:
DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])
# commit to database
conn.commit()
Edit #2:
Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.
Code changed:
outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))
cur.copy_from(outf, 'receptor_results')