Extracting Fasta Moonlight Protein Sequences with Python

Question

I want to extract the FASTA files that have the aminoacid sequence from the Moonlighting Protein Database ( www.moonlightingproteins.org/results.php?search_text= ) via Python, since it's an iterative process, which I'd rather learn how to program than manually do it, b/c come on, we're in 2016. The problem is I don´t know how to write the code, because I'm a rookie programmer :( . The basic pseudocode would be:

 for protein_name in site: www.moonlightingproteins.org/results.php?search_text=:

       go to the uniprot option 

       download the fasta file 

       store it in a .txt file inside a given folder

Thanks in advance!

I suggest googling 'web scraping with python intro' or similar terms and messing around with that a bit. Right now your question is a bit too abstract. — Swier, Sep 20 '16 at 19:18

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

I would strongly suggest to ask the authors for the database. From the FAQ:

I would like to use the MoonProt database in a project to analyze the amino acid sequences or structures using bioinformatics.

Please contact us at bioinformatics@moonlightingproteins.org if you are interested in using MoonProt database for analysis of sequences and/or structures of moonlighting proteins.

Assuming you find something interesting, how are you going to cite it in your paper or your thesis? "The sequences were scraped from a public webpage without the consent of the authors". Much better to give credit to the original researchers.

That's a good introduction to scraping

But back to your your original question.

import requests
from lxml import html
#let's download one protein at a time, change 3 to any other number
page = requests.get('http://www.moonlightingproteins.org/detail.php?id=3')
#convert the html document to something we can parse in Python
tree = html.fromstring(page.content)
#get all table cells
cells = tree.xpath('//td')

for i, cell in enumerate(cells):
    if cell.text:
        #if we get something which looks like a FASTA sequence, print it
        if cell.text.startswith('>'):
            print(cell.text)
    #if we find a table cell which has UniProt in it
    #let's print the link from the next cell
    if 'UniProt' in cell.text_content():
        if cells[i + 1].find('a') is not None and 'href' in cells[i + 1].find('a').attrib:
            print(cells[i + 1].find('a').attrib['href'])

Extracting Fasta Moonlight Protein Sequences with Python

1 Answers1