3

I have a dataset in a CSV file consisting of 2500 lines. The file is structured that (simplified) way:

id_run; run_name; receptor1; receptor2; receptor3_value; [...]; receptor50_value

Each receptor of the file is already in a table and have a unique id.

I need to upload each line to a table with this format:

id_run; id_receptor; receptor_value
1; 1; 2.5
1; 2; 3.2
1; 3, 2.1
[...]
2500, 1, 2.4
2500, 2, 3.0
2500, 3, 1.1

Actually, I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.

For 2500 runs (so 2500 lines in the CSV file) and 50 receptors, my Python program generates ~110000 records in the text file to be uploaded.

I'm dropping the foreign keys of the destination table and restoring them after the upload.

Using this method, it takes actually ~8 seconds to generate the text file and 1 second to copy the file to the table.

Is there a way, method, library or anything else I could use to accelerate the preparation of the data for the upload so that 90% of the time required isn't for the writing of the text file?

Edit:

Here is my (updated) code. I'm now using a bulk writing to the text file. It looks likes it faster (uploaded 110 000 lines in 3.8 seconds).

# Bulk write to file
lines = []
for line_i, line in enumerate(run_specs):
    # the run_specs variable consists of the attributes defining a run 
    # (id_run, run_name, etc.). So basically a line in the CSV file without the 
    # receptors data
    sc_uid = get_uid(db, table_name) # function to get the unique ID of the run
    for rec_i, rec in enumerate(rec_uids):
        # the rec_uids variable is the unique IDs in the database for the 
        # receptors in the CSV file
        line_to_write = '%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i])
        lines.append(line_to_write)

# write to file
fn = r"data\tmp_data_bulk.txt"
with open(fn, 'w') as tmp_data:
    tmp_data.writelines(lines)

# get foreign keys of receptor_results
rr_fks = DB.get_fks(conn, 'receptor_results') # function to get foreign keys

# drop the foreign keys
for key in rr_fks:
    DB.drop_fk(conn, 'receptor_results', key[0]) # funciton to drop FKs

# upload data with custom function using the COPY SQL command
DB.copy_from(conn, fn, 'receptor_results', ['sc_uid', 'rec_uid', 'value'],\
                                                                    " ", False)

# restore foreign keys
for key in rr_fks:
    DB.create_fk(conn, 'receptor_results', key[0], key[1], key[2])

# commit to database
conn.commit()

Edit #2:

Using cStringIO library, I replaced the creation of a temporary text file with a filelike object, but the speed gains is very very small.

Code changed:

outf = cStringIO.StringIO()
for rec_i, rec in enumerate(rec_uids):
    outf.write('%s %s %s\n' % (sc_uid, rec, rec_values[line_i][rec_i]))

cur.copy_from(outf, 'receptor_results')
kaycee
  • 901
  • 1
  • 9
  • 35
  • 1
    What code do you currently use? [mcve] – Håken Lid Jun 03 '16 at 16:41
  • I edited the main post with my code ! – kaycee Jun 03 '16 at 16:59
  • I if your question is about how to prepare your text data fast, then it got nothing to do with the database server. This is confusing. – vitaly-t Jun 03 '16 at 17:03
  • I'm looking for ways to make my process faster, either using other commands to prepare my text file or uploading differently to postgres – kaycee Jun 03 '16 at 17:05
  • @kaycee I have answered that below ;) – vitaly-t Jun 03 '16 at 17:11
  • copy_from takes any [file-like object](http://initd.org/psycopg/docs/usage.html#using-copy-to-and-copy-from), you could take advantage of that and never write the data to a file at all. If I get time later I'll post an answer along those lines. I think COPY should outperform inserts (even multi-line inserts) with a few thousand records. – langton Jun 04 '16 at 10:03
  • Can you time the individual calls so we know how long is spent managing the keys vs the copy vs filling the stringio instance? – langton Jun 06 '16 at 17:37

2 Answers2

3

Yes, there is something you can do to speed up writing the data to the file in advance: don't bother!

You already fit the data into memory, so that isn't an issue. So, instead of writing the lines to a list of strings, write them to a slightly different object - a StringIO instance. Then the data can stay in memory and serve as the parameter to psycopg2's copy_from function.

filelike = StringIO.StringIO('\n'.join(['1\tA', '2\tB', '3\tC']))
cursor.copy_from(filelike, 'your-table-name')

Notice that the StringIO must contain the newlines, the field separators and so on - just as the file would have.

langton
  • 126
  • 1
  • 3
  • I used the cStringIO library, but the speed gain is very minimal. See Edit #2 in the main post. – kaycee Jun 06 '16 at 15:55
1

I'm writing all the data I need to upload in a .txt file and I'm using the COPY command from postgreSQL to transfer the file to the destination table.

It is a heavy and unnecessary round-trip for all your data. Since you already have it in memory, you should just translate it into a multi-row insert directly:

INSERT INTO table(col1, col2) VALUES (val1, val2), (val3, val4), ...

i.e. concatenate your data into such a query and execute it as is.

In your case you would probably generate and execute 50 such inserts, with 2500 rows in each, according to your requirements.

It will be the best-performing solution ;)

vitaly-t
  • 24,279
  • 15
  • 116
  • 138