I would like to download genbank files from NCBI using Biopython and a list of accession numbers (note that I call the script with an email address as an argument e.g., python scriptName.py emailAddress)
import os
import os.path
import sys
from Bio import Entrez
import datetime
import time
###############################################################################
# Call Entrez to download files
# If downloading more than 100 files...
# Run this script only between 9pm-5am Monday - Friday EST
# Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
# Make no more than 3 requests every 1 second.
# Use URL parameter email & tool for distributed software
# NCBI's Disclaimer and Copyright notice must be evident to users of your service.
#
# Use this script at your own risk.
# Neither the script author nor author's employers are responsible for consequences arising from improper usage
###############################################################################
start_day = datetime.date.today().weekday() # 0 is Monday, 6 is Sunday
start_time = datetime.datetime.now().time()
accession = ['NC_002695.1', 'NC_002128.1', 'NC_002127.1']
arguments = sys.argv
Entrez.email = arguments[1]
if ((start_day < 5 and start_time > datetime.time(hour=21)) or (start_day < 5 and start_time < datetime.time(hour=5)) or start_day > 5 or len(accession) <= 100 ):
for a in accession:
print(a)
if ( (datetime.date.today().weekday() < 5 and datetime.datetime.now().time() > datetime.time(hour=21)) or (datetime.date.today().weekday() < 5 and datetime.datetime.now().time() < datetime.time(hour=5)) or (datetime.date.today().weekday() == start_day + 1 and datetime.datetime.now().time() < datetime.time(hour=5)) or (datetime.date.today().weekday() > 5) or len(accession) <= 100 ):
while True:
print('Downloading ' + a)
#try:
handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text')
FILENAME = 'GenbankFiles_E_coli/' + a + '.gb'
local_file=open(FILENAME,'w')
local_file.write(handle.read())
handle.close()
local_file.close()
When I run the script, it does a partial download of the first file (only) before I get an error:
NC_002128.1
Downloading NC_002128.1
Downloading NC_002128.1
Traceback (most recent call last):
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1182, in do_open
h.request(req.get_method(), req.selector, req.data, headers)
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1088, in request
self._send_request(method, url, body, headers)
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1126, in _send_request
self.endheaders(body)
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 1084, in endheaders
self._send_output(message_body)
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 922, in _send_output
self.send(msg)
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 857, in send
self.connect()
File "/usr/local/anaconda3/lib/python3.4/http/client.py", line 834, in connect
self.timeout, self.source_address)
File "/usr/local/anaconda3/lib/python3.4/socket.py", line 494, in create_connection
for res in getaddrinfo(host, port, 0, SOCK_STREAM):
File "/usr/local/anaconda3/lib/python3.4/socket.py", line 533, in getaddrinfo
for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "download_files.py", line 92, in <module>
handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text')
File "/usr/local/anaconda3/lib/python3.4/site-packages/Bio/Entrez/__init__.py", line 155, in efetch
return _open(cgi, variables, post)
File "/usr/local/anaconda3/lib/python3.4/site-packages/Bio/Entrez/__init__.py", line 516, in _open
handle = _urlopen(cgi)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 161, in urlopen
return opener.open(url, data, timeout)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 463, in open
response = self._open(req, data)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 481, in _open
'_open', req)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 441, in _call_chain
result = func(*args)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1210, in http_open
return self.do_open(http.client.HTTPConnection, req)
File "/usr/local/anaconda3/lib/python3.4/urllib/request.py", line 1184, in do_open
raise URLError(err)
urllib.error.URLError: <urlopen error [Errno -2] Name or service not known>
I have yet to determine if there is an error in my code, if there is an issue with the modules I selected (even though Biopython should handle the calls), if there is an issue with my connection (my job blocks and throttles without warning), or if it is something else.
I have tried running it with and without urllib*/http* modules to no avail (I get the same error). However, the partial file is interesting. Everything up to the final sequences is downloaded (with a contig entry at the end). Here are the last lines of the downloaded file:
/translation="MVPPSAVLCYHNEISRQIPVNMKNIRTEFIPRFNLTLCFPRYWM
TWTGIGIICVFAMVPPALRDPLLGKLGMLVGRLGKSARQRALINLSLCFPEYSDKEKE
NIVDAMFATASMAVVLMAELALSGPDKISHRIRWNGLEIVEKMAQNNEKVIFLVPHAW
GVDIPAMLMAASGRKMAAMFHNQRNPVVDYVWNSVRRRFGGKLHARNDGIASFVRSVR
QGYWGYYLPDQDHGPEFSEFADFFATYKATLPVIGRLSRISGARIIPLFPVYDGKTHH
LTIHVSPPLAIRQKSDAHIARQINEVVENFVRPHPEQYTWILKLLKTRKEGEEDPY"
CONTIG join(AB011549.2:1..92721) ###this line is not in the original file
//
Which can be compared to the original genbank file: http://www.ncbi.nlm.nih.gov/nuccore/10955266/?report=genbank
I can confirm that it is not an error related to DTD files but everything else is up for grabs. (The new RefSeq release from NCBI is compatible with Bio.Entrez.Parser?)
I am running this script on CentOS Python 3.4.3 :: Anaconda 2.3.0 (64-bit) :: Biopython 1.66