I am using String-db interactions for a project but I found out that the complete list of interactions uses their ensembl protein IDs. I would like to replace those ensembl protein ID names with their HGNC approved gene symbols. Using BioMart, I downloaded a table with the ensembl protein IDs and their respective gene name. So I need to match all the ensembl IDs found in the string file (both first and second column have ensembl IDs) with their ensembl ID in my biomart file and then replace each ID with their gene symbol found, also, in the biomart file. I have the extra complication that I have an "9606." before each ensembl ID in the string file and not in the BioMart file. This number denotes that it is a human ensembl ID and still needs to be there in the new tbl String file.
Note: These files are big!
String file example (first 10 lines):
9606.ENSP00000000233 9606.ENSP00000263025 ptmod f f 150
9606.ENSP00000000233 9606.ENSP00000265709 reaction f f 908
9606.ENSP00000000233 9606.ENSP00000265709 catalysis t t 908
9606.ENSP00000000233 9606.ENSP00000263025 inhibition inhibition t t 154
9606.ENSP00000000233 9606.ENSP00000265709 binding f t 908
9606.ENSP00000000233 9606.ENSP00000265709 catalysis t f 908
9606.ENSP00000000233 9606.ENSP00000263025 inhibition inhibition f t 150
9606.ENSP00000000233 9606.ENSP00000263025 inhibition inhibition f f 150
9606.ENSP00000000233 9606.ENSP00000265709 binding f f 908
9606.ENSP00000000233 9606.ENSP00000263025 catalysis t t 156
BioMart file (example made to work with above file):
Ensembl_Protein_ID Gene_Symbol
ENSP00000265709 ANK1
ENSP00000000233 ARF5
ENSP00000263025 MAPK3
ENSP00000388118 NCSTN
Output file:
9606.ARF5 9606.MAPK3 ptmod f f 150
9606.ARF5 9606.ANK1 reaction f f 908
9606.ARF5 9606.ANK1 catalysis t t 908
9606.ARF5 9606.MAPK3 inhibition inhibition t t 154
9606.ARF5 9606.ANK1 binding f t 908
9606.ARF5 9606.ANK1 catalysis t f 908
9606.ARF5 9606.MAPK3 inhibition inhibition f t 150
9606.ARF5 9606.MAPK3 inhibition inhibition f f 150
9606.ARF5 9606.ANK1 binding f f 908
9606.ARF5 9606.MAPK3 catalysis t t 156
I have no idea how to do this. I have tried using awk and perl but nothing works. I'm still a noob at the bioinformatics stuff. If anyone out there is willing to help this poor fellow, I would greatly appreciate it.