I have a list of UniprotIDs with a corresponding residue of interest (e.g. Q7TQ48_S442). I need to retrieve the +/-6 residues around the specific site within the protein sequence(in the example, the sequence I need would be DIEAEASEERQQE). Can you suggest a method to do it for a list of IDs + residue of interest using Python, R, or an already available web-tool? Thanks, Emanuele
-
I think this question would fit well in the bioinformatics group. Is there a way to move it there? – Melissa Key Apr 14 '18 at 17:13
-
Thanks Melissa. This is my first post and am not sure how to do it... – Emanuele Loro Apr 15 '18 at 13:19
1 Answers
If I enter a list of protein IDs into UniProt from https://www.uniprot.org/uploadlists/ or by uploading a file, I get a table of results. At the top of the table, there is an option that allows you to select the columns - one option is the peptide sequence. (no programming needed so far - just upload the list of UIDs you are interested in).
Now, to extract the specific sequence, this can be done in R using the substr
command. Here, we'd want to add/subtract 6 from either end:
len13seq <- with(uniprot_data, substr(peptide_sequence, start = ind - 6, stop = ind + 6 ))
where in your example, ind = 442
.
To make this work you need to
- Separate your tags into two(+?) columns - the UniprotID and the site index. You can also include the amino acid if you need it for later analyses.
- Create a file with just the UniProtIDs which is fed into the UniProt database.
- Customize the displayed columns, making sure to get the sequence.
- Download the result and read it into R.
- Merge the original data frame (with the site index) with the downloaded results.
- generate the sequence in the neighborhood around your point of interest.
It is possible to do this entirely within R - I did that at one point, but I'm not sure you need it unless you need the entire thing to be automated. If that's what you need, I would suggest checking out https://www.bioconductor.org/packages/3.7/bioc/html/UniProt.ws.html. I don't use Bioconductor often, so I'm not familiar with the package. When I previously used R to get UniProt data, what I was after was not available in the tablular output, and I had to modify my code quite a bit to get to the data I was after. Hopefully, the Bioconductor solution is easier than what I did.

- 4,476
- 12
- 21