I have a list of UniProt IDs, and would like to convert them to NCBI (formally Entrez) gene IDs. The code is working 'fine', except for there are a lot of missing values, despite that information being in the online version of the NCBI gene pages and the UniProt pages. Can anyone understand / help me get matches for all of the proteins I provide?
The code I have is as follows:
library("tidyverse")
library("biomaRt")
proteins <- c("P02680","Q63041", "Q5EBC0","P02770")
mart <- useMart("ensembl", dataset="rnorvegicus_gene_ensembl")
attributes <- c('entrezgene_id','entrezgene_accession', 'uniprotswissprot')
protein_to_gene <- getBM(attributes=attributes, filters= "uniprotswissprot", values=proteins, mart=mart)
The output from this is a df with three columns of data entrezgene_id, entrezgene_accession and uniprotswissprot, exactly as I expect.
Unexpectedly I have only two rows of data:
entrezgene_id | entrezgene_accession | uniprotswissprot
24367 | Fgg | P02680
252922 | Pzp | Q63041
The two data queries with no output (Q5EBC0 and P02770 in this example) do have corresponding gene IDs that are listed on UniProt and in the NCBI gene database. So https://www.uniprot.org/uniprotkb/P02770/ gives the gene ID as 24186 and https://www.ncbi.nlm.nih.gov/gene/?term=24186 shows the UniProt ID as P02770.
My actual search list is considerably longer - some 350 proteins. Only about one third of them give me a hit. At least some, if not all, of those proteins without a hit have matches in both the UniProt and NCBI gene databases as exemplified above for albumin.
Does anybody have an explanation for what is happening here, or why my search results are so full of holes? Alternatively any other way to reliably do the conversion?
Thanks!