1

I would like to download genbank files from NCBI using Biopython and a list of accession numbers (note that I call the script with an email address as an argument e.g., python scriptName.py emailAddress)

        import os
        import os.path
        import sys
        from Bio import Entrez
        import datetime
        import time

      ###############################################################################
        # Call Entrez to download files
        # If downloading more than 100 files...
        # Run this script only between 9pm-5am Monday - Friday EST
        # Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
        # Make no more than 3 requests every 1 second.
        # Use URL parameter email & tool for distributed software
        # NCBI's Disclaimer and Copyright notice must be evident to users of your service. 
        #
        # Use this script at your own risk. 
        # Neither the script author nor author's employers are responsible for consequences arising from improper usage 
        ###############################################################################
        start_day = datetime.date.today().weekday() # 0 is Monday, 6 is Sunday
        start_time = datetime.datetime.now().time()
        accession = ['NC_002695.1', 'NC_002128.1', 'NC_002127.1']
        arguments = sys.argv
        Entrez.email = arguments[1]

        if ((start_day < 5 and start_time > datetime.time(hour=21)) or (start_day < 5 and start_time < datetime.time(hour=5)) or start_day > 5 or len(accession) <= 100 ):
            for a in accession:
                print(a)
                if ( (datetime.date.today().weekday() < 5 and datetime.datetime.now().time() > datetime.time(hour=21)) or (datetime.date.today().weekday() < 5 and datetime.datetime.now().time() < datetime.time(hour=5)) or (datetime.date.today().weekday() == start_day + 1 and datetime.datetime.now().time() < datetime.time(hour=5)) or (datetime.date.today().weekday() > 5) or len(accession) <= 100 ):
                    while True:
                        print('Downloading ' + a)
                        #try:
                        handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text') 
                        FILENAME =  'GenbankFiles_E_coli/' + a + '.gb'
                        local_file=open(FILENAME,'w')
                        local_file.write(handle.read())
                        handle.close()
                        local_file.close()

When I run the script, it does a partial download of the first file (only) before I get an error:

NC_002128.1
Downloading NC_002128.1
Downloading NC_002128.1
Traceback (most recent call last):
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1182, in do_open
    h.request(req.get_method(), req.selector, req.data, headers)
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1088, in request
    self._send_request(method, url, body, headers)
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1126, in _send_request
    self.endheaders(body)
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1084, in endheaders
    self._send_output(message_body)
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 922, in _send_output
    self.send(msg)
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 857, in send
    self.connect()
  File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 834, in connect
    self.timeout, self.source_address)
  File "/usr/local/anaconda3/lib/python3.4/socket.py", line 494, in create_connection
    for res in getaddrinfo(host, port, 0, SOCK_STREAM):
  File "/usr/local/anaconda3/lib/python3.4/socket.py", line 533, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "download_files.py", line 92, in <module>
    handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text') 
  File "/usr/local/anaconda3/lib/python3.4/site-packages/Bio/Entrez/__init__.py", line 155, in efetch
    return _open(cgi, variables, post)
  File "/usr/local/anaconda3/lib/python3.4/site-packages/Bio/Entrez/__init__.py", line 516, in _open
    handle = _urlopen(cgi)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 161, in urlopen
    return opener.open(url, data, timeout)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 463, in open
    response = self._open(req, data)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 481, in _open
    '_open', req)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 441, in _call_chain
    result = func(*args)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1210, in http_open
    return self.do_open(http.client.HTTPConnection, req)
  File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1184, in do_open
    raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>

I have yet to determine if there is an error in my code, if there is an issue with the modules I selected (even though Biopython should handle the calls), if there is an issue with my connection (my job blocks and throttles without warning), or if it is something else.

I have tried running it with and without urllib*/http* modules to no avail (I get the same error). However, the partial file is interesting. Everything up to the final sequences is downloaded (with a contig entry at the end). Here are the last lines of the downloaded file:

    /translation="MVPPSAVLCYHNEISRQIPVNMKNIRTEFIPRFNLTLCFPRYWM
TWTGIGIICVFAMVPPALRDPLLGKLGMLVGRLGKSARQRALINLSLCFPEYSDKEKE
NIVDAMFATASMAVVLMAELALSGPDKISHRIRWNGLEIVEKMAQNNEKVIFLVPHAW
GVDIPAMLMAASGRKMAAMFHNQRNPVVDYVWNSVRRRFGGKLHARNDGIASFVRSVR
QGYWGYYLPDQDHGPEFSEFADFFATYKATLPVIGRLSRISGARIIPLFPVYDGKTHH
LTIHVSPPLAIRQKSDAHIARQINEVVENFVRPHPEQYTWILKLLKTRKEGEEDPY"
CONTIG      join(AB011549.2:1..92721) ###this line is not in the original file
//

Which can be compared to the original genbank file: http://www.ncbi.nlm.nih.gov/nuccore/10955266/?report=genbank

I can confirm that it is not an error related to DTD files but everything else is up for grabs. (The new RefSeq release from NCBI is compatible with Bio.Entrez.Parser?)

I am running this script on CentOS Python 3.4.3 :: Anaconda 2.3.0 (64-bit) :: Biopython 1.66

Community
  • 1
  • 1
cer
  • 1,961
  • 2
  • 17
  • 26
  • What version of Biopython are you using? – MattDMo Dec 01 '15 at 20:41
  • I believe it is biopython-1.65. I can't seem to find a way to do the programName -version for Biopython -- can you suggest something? – cer Dec 01 '15 at 20:53
  • `import Bio; print(Bio.__version__)` will tell you. You might want to upgrade, 1.66 is the latest version. I ran a modified version of your code, and I'm not getting any errors, but the Genbank files don't include the nucleotide sequences shown at nuccore, they end just like your example. – MattDMo Dec 01 '15 at 20:57
  • The output shows that I am using Biopython version 1.66 (thanks for the print trick). Can you please confirm that you were able to download all 3 files and possible share your modified code? – cer Dec 01 '15 at 21:15
  • 2
    Check out [this gist](https://gist.github.com/MattDMo/f710d92ef181c05d6073). I only included one `.gb` file, I can add the other two if you want. – MattDMo Dec 01 '15 at 23:53
  • MattdMo -- I really appreciate your efforts on this. I can rule out the code and I believe I am dealing with job-related network restrictions. Thank you again. – cer Dec 02 '15 at 16:03

1 Answers1

0

The NCBI Entrez fetch API distinguishes return types rettype="gb" and rettype="gbwithparts", the first can be shorter by giving you CONTIG lines referencing other records, while the later would expand these to give you the full sequence (look for "GenBank (full)" in the website).

You can sometimes get a glimpse of this when something is broken on the NCBI side, e.g. http://blastedbio.blogspot.co.uk/2012/03/missing-external-exons-in-genbank-with.html and http://blastedbio.blogspot.co.uk/2012/04/missing-feature-locations-in-genbank.html (fixed since).

I would guess that the socket error you are getting is related to your work internet access controls.

Peter Cock
  • 1,585
  • 1
  • 9
  • 14
  • access was indeed the problem. Not a pretty solution, but I download the files onto another server (outside of work) and then scp the files over to my work server. – cer Dec 23 '15 at 19:40
  • Thanks for confirming that. – Peter Cock Dec 24 '15 at 23:39