0

I made a python script for downloading protein sequences from Uniprot in fasta format. The script will read the accession numbers from a text file containing the accession numbers (one on each line) and then try to download the respective sequence from UniProt database. Here is the script:

import requests

with open ('testfasta.txt', 'r') as infile:
    lines = infile.readlines()
count = 0
for line in lines:
    count+=1
    line = line.strip()
    access_id = line
    url_part1 = 'https://rest.uniprot.org/uniprotkb/'
    url_part2 = '.fasta'

    URL = url_part1+access_id+url_part2
              
    response = requests.get (URL)
              
    with open((access_id)+".fa", "wb") as txtFile:
        txtFile.write(response.content)

print ("Total sequences downloaded = ", count)

This works fine but for hundreds of sequences, it will generate a large number of files. So, it is beneficial to have the next incoming sequence written below the first one, then second one after it and so on. A fasta file format is basically a text file containing text with its header marked with ">". e.g.

>firstseq_header
djsfkasdjfkasjdfkasjdflkasjdflkasjdfkasdjfsadk
iewurpwierpofasiodfjlkasdfklasjowieqrudsafdsaf
>secseq_header
dsfjsdfkjasfasdfhwrwerewrasdfasrwerasdfa
awerwerasafas
>nseq_header
ajskdfhjasdfhlasjdhfwueroywieuhsjadfh
hdsfkjh

and so on

tripleee
  • 175,061
  • 34
  • 275
  • 318
Irfan
  • 5
  • 2
  • We can't know what's in your text file, so this isn't really a [mre]. I tested my answer with `infile = ["P06213", "P14735", "P01308"]` based on an example on the web site. – tripleee Aug 18 '23 at 11:00
  • Thanks for your answer. Sorry for being unclear. The text file is a list of accession IDs (with each accession ID written on one line). The script reads and fetches the respective fasta sequence from Uniprot. In the meantime, I partially solved the problem by using .append. – Irfan Aug 21 '23 at 08:55

1 Answers1

0

Something like this? Just write them all to the same file.

import requests

with open('testfasta.txt', 'r') as infile,
     open('results.fasta', 'w') as outfile:
  for count, line in enumerate(infile, 1):
    access_id = line.strip()              
    response = requests.get(
      f'https://rest.uniprot.org/uniprotkb/{access_id}.fasta')
    # check that fetch succeeded; raise error if not
    response.raise_for_status()
    assert(response.text.startswith('>'))
    assert(response.text.endswith('\n'))
    outfile.write(response.text)

print (f"Total sequences downloaded = {count}")

This assumes that the data you fetch is newline-terminated, and includes the FASTA header before the sequence itself. If that's not necessarily always true, maybe replace the asserts with code to fix any such problems. I also made various changes to make it more idiomatic.

A vague complication is that the response.content you download is not text, but bytes. You could decode it if you wanted to, but of course, Requests already does this for you, and provides that in response.text

tripleee
  • 175,061
  • 34
  • 275
  • 318
  • The sequence download works. Next, I would like to print a notice that if I put a wrong ID and it doesn't fetch a proper sequence. I will try that and if it doesn't work, I'll make a new post. – Irfan Aug 22 '23 at 01:26
  • (This already generates a traceback if you put an invalid ID which causes the fetch to fail.) – tripleee Aug 22 '23 at 04:00
  • Thanks again. Yes, it does give the error when I put an invalid ID. So far, I haven't figured out how to count such instances, and get a correct count of how many sequences downloaded and how many unsuccessful with their respective IDs, so I know which ones I need to find elsewhere. – Irfan Aug 22 '23 at 09:07
  • Then don't raise an exception, just add the failed one to a list, and then print the list of failed IDs to the terminal or save it to a file. In brief, if the return code in the response is 4xx or 5xx, it failed. See also e.g. https://stackoverflow.com/questions/61463224/when-to-use-raise-for-status-vs-status-code-testing – tripleee Aug 22 '23 at 09:28