Alternative to Bio.Entrez EFetch for downloading full genome sequences from NCBI

Question

My goal is to download full metazoan genome sequences from NCBI. I have a list of unique ID numbers for the genome sequences I need. I planned to use the Bio.Entrez module EFetch to download the data but learned today via the Nov 2, 2011 release notes (http://1.usa.gov/1TA5osg) that EFetch does not support the 'Genome' database. Can anyone suggest an alternative package/module or some other way around this? Thank you in advance!

If it is only one genome, why not manually download it from the FTP server ftp://ftp.ncbi.nih.gov/genomes/ ? — Maximilian Peters, May 26 '16 at 11:22
Thanks for the link! I need to download ~300 genome sequences so I would prefer not to do that manually. — Melissa DeBiasse, May 26 '16 at 13:55
When you say you have unique ID numbers, are these GI or accession numbers? Additionally, I am assuming that you want to download the fasta files -- can you confirm if that is the case. — cer, May 26 '16 at 14:13
Thanks for the reply cer! Yes, I want fasta files. The ID number is called a "genome ID." For example, searching genome ID 10839 in the Genome database takes you to the carp genome. There are two other IDs associated with the carp genome on the Assembly database (310951 [UID] 1682368 [GenBank]). Ive searched the genome db multiple ways and get links and ID numbers, but not an accession number for genome sequence in the Nucleotide database. Does such an accession even exist? There are accession numbers for scaffolds within genomes (for example LN590719.1). Is that the best I can get? Thanks! — Melissa DeBiasse, May 26 '16 at 15:54
Melissa, I am not familiar with this genome ID. However, if you were amenable, you could do an NCBI (nucleotide database) search using terms that fit your criteria, and then download the resulting list of accession numbers. If you have those accession numbers (or GI numbers), there is a script that I can post that will download a fasta file for each number. Let me know if you think this approach will work for you. OH -- in what environment are you working (Mac, Linux, Windows)? — cer, May 26 '16 at 18:00
The other option is to include the search terms in the script so that you get something like this: http://stackoverflow.com/questions/37059616/using-search-terms-with-biopython-to-return-accession-numbers — cer, May 26 '16 at 18:06
Cer, what a good thought. I have a nice script that will return all the nucleotide sequences associated with a species name but this is more data than I need and a waste of computational resources. I think your idea to amend the search term to include the keyword 'complete genome' is a good option. I work on a Mac. — Melissa DeBiasse, May 26 '16 at 19:41

score 1 · Answer 1 · answered May 26 '16 at 20:26

Here is a script for you -- though you may need to tinker with it to make it work. Name the script whatever you prefer, but when you call the script do so as follows: python name_of_script[with .py extension] your_email_address.

You need to add your email to the end of the call else it will not work. If you have a text file of accession numbers (1/line), then choose option 2. If you choose option 1, it will ask you for items like the name of the organism, strain name, and keywords. Use as many keywords as you would like -- just be certain to separate them by commas. If you go with the first option, NCBI will be searched and will return GI numbers [NOTE: NCBI is phasing out the GI numbers in 9.2016 so this script may not work after this point] which will then be used to snag the accession numbers. Once all the accession numbers are present, a folder is created, and a subfolder is created for each accession number (named as the accession number). In each subfolder, the corresponding fasta AND genbank file will be downloaded. These files will carry the accession number as the file name (e.g. accession_number.fa, accession_number.gb). Edit script to your purposes.

ALSO...Please note the warning (ACHTUNG) portion of the script. Sometimes the rules can be bent...but if you are egregious enough, your IP may be blocked from NCBI. You have been warned.

import os
import os.path
import sys
import re #regular expressions
from Bio import Entrez
import datetime
import time
import glob

arguments = sys.argv
Entrez.email = arguments[1] #email

accession_ids = []
print('Select method for obtaining the accession numbers?\n')
action = input('1 -- Input Search Terms\n2 -- Use text file\n')

if action == '1':
    print('\nYou will be asked to enter an organism name, a strain name, and keywords.')
    print('It is not necessary to provide a value to each item (you may just hit [ENTER]), but you must provide at least one item.\n')

    organism = input('Enter the organism you wish to search for (e.g. Escherichia coli [ENTER])\n')
    strain = input('Enter the strain you wish to search for. (e.g., HUSEC2011 [ENTER])\n')
    keywords = input('Enter the keywords separated by a comma (e.g., complete genome, contigs, partial [ENTER])\n')
    search_phrase = ''

    if ',' in keywords:
        keywords = keywords.split(',')

    ncbi_terms = ['organism', 'strain', 'keyword']
    ncbi_values = [organism, strain, keywords]

    for index, n in enumerate(ncbi_values):
        if index == 0 and n != '':
            search_phrase = '(' + n + '[' + ncbi_terms[index] + '])'
        else:
            if n != '' and index != len(ncbi_values)-1:
                search_phrase = search_phrase + ' AND (' + n + '[' + ncbi_terms[index] + '])'
        if index == len(ncbi_values)-1 and n != '' and type(n) is not list:
            search_phrase = search_phrase + ' AND (' + n + '[' + ncbi_terms[index] + '])'
        if index == len(ncbi_values)-1 and n != '' and type(n) is list:
            for name in n:
                name = name.lstrip()
                search_phrase = search_phrase + ' AND (' + name + '[' + ncbi_terms[index] + '])'


    print('Here is the complete search line that will be used: \n\n', search_phrase)

    handle = Entrez.esearch(db='nuccore', term=search_phrase, retmax=1000, rettype='acc', retmode='text')
    result = Entrez.read(handle)
    handle.close()
    #print(result['Count'])
    gi_numbers = result['IdList']

    fetch_handle = Entrez.efetch(db='nucleotide', id=result['IdList'], rettype='acc', retmode='text')
    accession_ids = [id.strip() for id in fetch_handle]
    fetch_handle.close()

if action == '2': #use this option if you have a file of accession #s
    file_name = input('Enter the name of the file\n')

    with open(file_name, 'r') as input_file:
        lines = input_file.readlines()

        for line in lines:
            line = line.replace('\n', '')
            accession_ids.append(line)
#--------------------------------------------------------------------------------------------------------------

#----------------------------------- Make directory to store files --------------------------------------------
new_path = 'Genbank_Files/'
if not os.path.exists(new_path):
    os.makedirs(new_path)

print('You have ' + str(len(accession_ids)) + ' file(s) to download.') #print(accession_ids)

ending='.gb'

files = []
##CHECK IF FILE HAS BEEN DOWNLOADED
for dirpath, dirnames, filenames in os.walk(new_path):
    for filename in [f for f in filenames if f.endswith(ending)]: #for zipped files
        files.append(os.path.join(dirpath,filename))


for f in files:
    f = f.rsplit('/')[-1]
    f = f.replace('.gb', '')

    if f in accession_ids:
        ind = accession_ids.index(f)
        accession_ids.pop(ind)

print('')
print('You have ' + str(len(accession_ids)) + ' file(s) to download.')
#--------------------------------------------------------------------------

###############################################################################
#---ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG--ACHTUNG----#
###############################################################################
# Call Entrez to download files
# If downloading more than 100 files...
# Run this script only between 9pm-5am Monday - Friday EST
# Send E-utilities requests to http://eutils.ncbi.nlm.nih.gov
# Make no more than 3 requests every 1 second (Biopython takes care of this).
# Use URL parameter email & tool for distributed software
# NCBI's Disclaimer and Copyright notice must be evident to users of your service. 
#
# Use this script at your own risk. 
# Neither the script author nor author's employers are responsible for consequences arising from improper usage 
###############################################################################


# CALL ENTREZ: Call Entrez to download genbank AND fasta (nucleotide) files using accession numbers.
###############################################################################
start_day = datetime.date.today().weekday() # 0 is Monday, 6 is Sunday
start_time = datetime.datetime.now().time()
print(str(start_day), str(start_time))
print('')

if ((start_day < 5 and start_time > datetime.time(hour=21)) or (start_day < 5 and start_time < datetime.time(hour=5)) or start_day > 5 or len(accession_ids) <= 100 ):
    print('Calling Entrez...')

    for a in accession_ids:
        if ((datetime.date.today().weekday() < 5 and datetime.datetime.now().time() > datetime.time(hour=21)) or 
            (datetime.date.today().weekday() < 5 and datetime.datetime.now().time() < datetime.time(hour=5)) or 
            (datetime.date.today().weekday() == start_day + 1 and datetime.datetime.now().time() < datetime.time(hour=5)) or 
            (datetime.date.today().weekday() > 5) or len(accession_ids) <= 100 ):

            print('Downloading ' + a)

            new_path = 'Genbank_Files/' + a  + '/'
            if not os.path.exists(new_path):
                os.makedirs(new_path)

            handle=Entrez.efetch(db='nucleotide', id=a, rettype='gb', retmode='text', seq_start=0)
            FILENAME = new_path + a + '.gb'
            local_file=open(FILENAME,'w')
            local_file.write(handle.read())
            handle.close()
            local_file.close()

            handle=Entrez.efetch(db='nucleotide', id=a, rettype='fasta', retmode='text') 
            FILENAME =  new_path + a + '.fna'
            local_file=open(FILENAME,'w')
            local_file.write(handle.read())
            handle.close()
            local_file.close()

else:
    print('You have too many files to download at the time. Try again later.')
#-------

Alternative to Bio.Entrez EFetch for downloading full genome sequences from NCBI

1 Answers1