Matching and merging two text tables?

Question

I have 2 (pretty big, ~15k lines) csv tables in the following formats:

Disease/Trait                Mapped_gene    p-Value 
Wegener's granulomatosis    HLA-DPB1        2.00E-50    
Wegener's granulomatosis    TENM3 - DCTD    2.00E-06    
Brugada syndrome            SCN5A           1.00E-14    
Brugada syndrome            SCN10A          1.00E-68    
Brugada syndrome            HEY2 - NCOA7    5.00E-17    
Major depressive disorder   IRF8 - FENDRR   3.00E-07    


Identifier  Homologues  Symbol
CG11621     5286    HEY2
CG11621     5287    IRF8
CG11621     5287    PIK3C2B
CG11621     5288    PIK3C2G
CG11621     5288    PIK3C2G
CG11949     2035    DCTD
CG11949     2035    EPB41
CG11949     2036    EPB41L1
CG11949     2037    EPB41L2

I'd like to use Python to compare the tables such that if any of the "Symbol" column from table 2 match "Mapped_gene" from table 1, the matching rows from each table can be merged together and put into an output file.

I've tried using the Pandas plugin, but couldn't make it work. Has anyone got any better ideas?

Thanks.

How big are the csv files? just thinking about can this be done in memory or do i need a database? — Crazyshezy, Feb 11 '14 at 12:14
Can there be more than one disease per mapped gene? (ie. two or more rows in the first table with the same value in column `Mapped_gene`) — Ricardo Cárdenes, Feb 11 '14 at 12:57

score 0 · Accepted Answer · answered Feb 11 '14 at 13:09

This should work as you want:

import csv

diseases = {}

# Load the disease file in memory
with csv.reader(open('table1.csv', 'rb')) as dfile:
    # Skip the header
    dfile.next()
    for disease, gene, pvalue in dfile:
        diseases[gene] = (disease, pvalue)

with csv.reader(open('table2.csv', 'rb')) as idfile, csv.writer(open('output.csv', 'wb')) as output:
    # Skip the header
    idfile.next()
    for ident, homologue, symbol in idfile:
        if symbol in diseases:
            output.writerow((ident, homologue, symbol) + diseases[symbol])

It assumes that every gene name under Mapped_gene is unique, though. It can be easily extended to cope with duplicates, otherwise.

Note that it will load up the whole disease file in memory first. 15000 lines shouldn't be too much (under 1MB, probably), but have it in mind if you go for something (much) bigger — Ricardo Cárdenes, Feb 12 '14 at 17:05

Matching and merging two text tables?

1 Answers1