0

I downloaded Uniprot files of a group of proteins (n>1000, so manually checking these proteins is no option). The complete data files come as either a flat text file or a XML file. There is a lot of information present in these files (for an example, see here: http://www.uniprot.org/uniprot/?query=organism%3A%22homo+sapiens%22, then go to download and you can look at the first 10 for the complete data, either txt or xml file).

Since there is a lot of information in their I do not need, I have to find a way how to select the information I'm interested in (preferably in a data matrix). For every entry this is:

Wanted information:                 Text file entry:    XML file entry:
Uniprot ID                          ID                  <entry><name>
Gene Name                           GN                  <gene><name type="primary">
Full protein name                   RecName:            <protein><recommendedName><fullName>
Transmembrane domains (may be more) TRANSMEM            <feature type="transmembrane region"><location> This consists of <begin position="xxx"/> and <end position="yyy"/>
Full protein sequence               SQ                  <sequence>

Some entries will not contain all the information (like transmembrane domains), and then a NA might be filled in. Some entries will contain more than 1 time information of the same kind (again like transmembrane domains) and for these, all should be named (if possible in the same cell, separated by "," or ";" or "|").

I am a bit familial with R, but I wasn't able to get to this point with that (might be lack of programming skills). I looked into XML editors (since this seems to be the easiest solution), but I wasn't able to get any to work, I simply couldn't find something that helped me on my way and explained the different steps. I also know that there should be a way to process XML files in R, but the help files there didn't get me where I need to be either. In XMLQuire, the only thing I could download so far, I'm able to see the file, but it keeps crashing on me when I want to do anything (even when I'm just trying to figure out where I can edit the file), so my file might be too long or there's another problem.

Help for this matter would highly be appreciated, I'm hoping to find someone who did a similar thing, but all solutions are welcome, no matter how small and no matter which program I need to use as long as it's freeware.

Also let me know if things are unclear, I really try to be as clear as possible. And sorry for being such a blondie on the subject.

  • I am little bit confusing. the XML part of the question is waht that you try to read or a result? I can't found it in your link. – agstudy Jan 11 '13 at 09:24
  • I wasn't able to get any proper results (xml reader crashes all the time), so there are no results in my post. It's mainly to explain what I start with and what I want to have and how that's called in either the txt or xml version of the file. I don't mind what I need to use to get to my desired data matrix, as long as I get there. I just have the feeling that the xml way might be the way to go, albeit I don't know how. – user1941884 Jan 11 '13 at 09:28
  • perhaps the [UniProt.ws](http://bioconductor.org/packages/2.12/bioc/html/UniProt.ws.html) package does what you want already? It requires use of R-devel. – Martin Morgan Jan 11 '13 at 14:16
  • Have you looked into BioPython packages? There are several ready to use parsers and suspect there is one for uniprot files. – fridaymeetssunday Jan 11 '13 at 15:13

1 Answers1

0

As I've mentioned in my comment, if you know or are willing to try (Bio)Python, there is a library that parses those files you've retrieved, Bio.SeqIO:

Let’s suppose you have download the whole of UniProt in the plain text SwissPort file format from their FTP site (ftp://ftp.uniprot.org/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot.dat.gz) and uncompressed it as the file uniprot_sprot.dat, and you want to extract just a few records from it:

from Bio import SeqIO
uniprot = SeqIO.index("uniprot_sprot.dat", "swiss")
handle = open("selected.dat", "w")
for acc in ["P33487", "P19801", "P13689", "Q8JZQ5", "Q9TRC7"]:
...handle.write(uniprot.get_raw(acc))
handle.close()

There is a longer example in Section 16.1.5 using the SeqIO.index() function to sort a large sequence file (without loading everything into memory at once).

5.4.2.2 Getting the raw data for a record

You can also have a look here: opening sequence files. A simpler problem, but similar to yours, has been answered in Biostar and this is probably what you need Parsing Swiss-Prot files

Basically you can extract the records from your files, store them inside a Python object and manipulate them at will. For instance retrieve IDs.

My answer is quite vague but should point in the right direction. I hope it helps.

fridaymeetssunday
  • 1,118
  • 1
  • 21
  • 31