How do you loop and match column 1 and 2 of file1 with cell in column1 of file2 in order to replace them with the adjacent cell in column 2 of file 2?

Question

I am using String-db interactions for a project but I found out that the complete list of interactions uses their ensembl protein IDs. I would like to replace those ensembl protein ID names with their HGNC approved gene symbols. Using BioMart, I downloaded a table with the ensembl protein IDs and their respective gene name. So I need to match all the ensembl IDs found in the string file (both first and second column have ensembl IDs) with their ensembl ID in my biomart file and then replace each ID with their gene symbol found, also, in the biomart file. I have the extra complication that I have an "9606." before each ensembl ID in the string file and not in the BioMart file. This number denotes that it is a human ensembl ID and still needs to be there in the new tbl String file.

Note: These files are big!

String file example (first 10 lines):

9606.ENSP00000000233    9606.ENSP00000263025    ptmod       f   f   150
9606.ENSP00000000233    9606.ENSP00000265709    reaction    f   f   908
9606.ENSP00000000233    9606.ENSP00000265709    catalysis   t   t   908
9606.ENSP00000000233    9606.ENSP00000263025    inhibition  inhibition  t   t   154
9606.ENSP00000000233    9606.ENSP00000265709    binding     f   t   908
9606.ENSP00000000233    9606.ENSP00000265709    catalysis   t   f   908
9606.ENSP00000000233    9606.ENSP00000263025    inhibition  inhibition  f   t   150
9606.ENSP00000000233    9606.ENSP00000263025    inhibition  inhibition  f   f   150
9606.ENSP00000000233    9606.ENSP00000265709    binding     f   f   908
9606.ENSP00000000233    9606.ENSP00000263025    catalysis   t   t   156

BioMart file (example made to work with above file):

Ensembl_Protein_ID  Gene_Symbol
ENSP00000265709 ANK1
ENSP00000000233 ARF5
ENSP00000263025 MAPK3
ENSP00000388118 NCSTN

Output file:

9606.ARF5   9606.MAPK3  ptmod       f   f   150
9606.ARF5   9606.ANK1   reaction    f   f   908
9606.ARF5   9606.ANK1   catalysis   t   t   908
9606.ARF5   9606.MAPK3  inhibition  inhibition  t   t   154
9606.ARF5   9606.ANK1   binding     f   t   908
9606.ARF5   9606.ANK1   catalysis   t   f   908
9606.ARF5   9606.MAPK3  inhibition  inhibition  f   t   150
9606.ARF5   9606.MAPK3  inhibition  inhibition  f   f   150
9606.ARF5   9606.ANK1   binding     f   f   908
9606.ARF5   9606.MAPK3  catalysis   t   t   156

I have no idea how to do this. I have tried using awk and perl but nothing works. I'm still a noob at the bioinformatics stuff. If anyone out there is willing to help this poor fellow, I would greatly appreciate it.

What did you try? Essentially you need make hash table of the BioMart data, then do a substitution in "string" file of the hash keys with the hash values. — beasy, Feb 26 '18 at 17:48

score 0 · Answer 1 · answered Feb 26 '18 at 18:02

Sounds like all you need (assuming all values from String are present in BioMart like in your example) is:

awk '
NR==FNR{ map[$1]=$2; next }
{
    for (i=1; i<=2; i++) {
        split($i,f,/[.]/)
        $i = f[1] "." map[f[2]]
    }
    print
}
' BioMartFile StringFile

How do you loop and match column 1 and 2 of file1 with cell in column1 of file2 in order to replace them with the adjacent cell in column 2 of file 2?

1 Answers1