getting a gene sequence from entrez using biopython

Question

This is what I want to do. I have a list of gene names for example: [ITGB1, RELA, NFKBIA]

Looking up the help in biopython and tutorial for API for entrez I came up with this:

x = ['ITGB1', 'RELA', 'NFKBIA']
for item in x:
    handle = Entrez.efetch(db="nucleotide", id=item ,rettype="gb")
    record = handle.read()
    out_handle = open('genes/'+item+'.xml', 'w') #to create a file with gene name
    out_handle.write(record)
    out_handle.close

But this keeps erroring out. I have discovered that if the id is a numerical id (although you have to make it in to a string to use, '186972394' so:

handle = Entrez.efetch(db="nucleotide", id='186972394' ,rettype="gb")

This gets me the info I want which includes the sequence.

So now to the Question: How can I search gene names (cause I do not have id numbers) or easily convert my gene names to ids to get the sequences for the gene list I have.

Thank you,

http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc101 Like this? `handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")` — Calvin Cheng, Nov 26 '12 at 02:55
Kind of.... handle = Entrez.esearch(db="nucleotide",term="Homo[orgn] AND RELA[gene]") is what I am using as an example since I want Homo sapien gene RELA but this returns a list of hits in a way. The first one happens to be what I want, but my gene list has about a 100 genes. How can I make sure for every gene I am getting the right ID using the method you pointed out? I just will be given lists of IDs. — StudentOfScience, Nov 26 '12 at 03:11

score 5 · Accepted Answer · answered Dec 02 '12 at 04:33

first with the gene name eg: ATK1

item = 'ATK1'
animal = 'Homo sapien' 
search_string = item+"[Gene] AND "+animal+"[Organism] AND mRNA[Filter] AND RefSeq[Filter]"

Now we have a search string to seach for ids

handle = Entrez.esearch(db="nucleotide", term=search_string)
record = Entrez.read(handleA)
ids = record['IdList']

this returns ids as a list if and if no id found it's []. Now lets assume it return 1 item in the list.

seq_id = ids[0] #you must implement an if to deal with <0 or >1 cases
handle = Entrez.efetch(db="nucleotide", id=seq_id, rettype="fasta", retmode="text")
record = handleA.read()

this will give you a fasta string which you can save to a file

out_handle = open('myfasta.fasta', 'w')
out_handle.write(record.rstrip('\n'))

Is it 'handleA' or 'handle' – dexterdev Dec 24 '14 at 07:07 — dexterdev, Dec 24 '14 at 07:07

RocketDonkey · Answer 2 · 2012-11-26T03:03:59.457

0

Looking at section 8.3 of the tutorial, there appears to be a function that will allow you to search for terms and get the corresponding IDs (I know nothing about this library and even less about biology, so this will potentially be completely wrong :) ).

>>> handle = Entrez.esearch(db="nucleotide",term="Cypripedioideae[Orgn] AND matK[Gene]")
>>> record = Entrez.read(handle)
>>> record["Count"]
'25'
>>> record["IdList"]
['126789333', '37222967', '37222966', '37222965', ..., '61585492']

From what I can tell, id refers to an actual ID number as returned by the esearch function (in the IdList attribute of the response). However if you use the term keyword, you can instead run a search and get the IDs of the matched items. Totally untested, but assuming the search supports boolean operators (it looks like AND works), you could try using a query like:

>>> handle = Entrez.esearch(db="nucleotide",term="ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]")
>>> record = Entrez.read(handle)
>>> record["IdList"]
# Hopefully your ids here...

To generate the term to insert, you could do something like this:

In [1]: l = ['ITGB1', 'RELA', 'NFKBIA']

In [2]: ' OR '.join('%s[Gene]' % i for i in l)
Out[2]: 'ITGB1[Gene] OR RELA[Gene] OR NFKBIA[Gene]'

The record["IdList"] could then be converted into a comma-delimited string and passed to the id argument in your original query by using something like:

In [3]: r = ['1234', '5678', '91011']

In [4]: ids = ','.join(r)

In [5]: ids
Out[5]: '1234,5678,91011'

edited Nov 26 '12 at 03:03

answered Nov 26 '12 at 02:58

RocketDonkey

36,383
7
80
84

I appreciate this, however as you predicted it does work (but not what I want..Kinda) Here is why: handle = Entrez.esearch(db="nucleotide",term="Homo[orgn] AND RELA[gene]") is what I am using as an example since I want Homo sapien gene RELA but this returns a list of hits in a way. The first one happens to be what I want, but my gene list has about a 100 genes. How can I make sure for every gene I am getting the right ID using the method you pointed out? I just will be given lists of IDs. – StudentOfScience Nov 26 '12 at 03:15
@StudentOfScience Hmm, so my understanding of this is that it is returning a list of results, correct? Is there a way in the standard search to get the exact information for a gene? Or is a search always involved? Assuming you knew that the first result is the one you want, you could just take the first element of the list `my_list[0]`. If that isn't precise enough, I assume you could do something like pull the results and then compare some portion of the output to see if it matched your gene, but that doesn't sound like the optimum solution. – RocketDonkey Nov 26 '12 at 03:24
Yea, that is what I do not know, and what I was asking. If there is a way to directly give species + gene name and get ID to use later to get the FASTA or just give a FASTA File directly (Fasta or output whatever) – StudentOfScience Nov 26 '12 at 03:30
@StudentOfScience I'll take a look around, but my best (and completely unfounded) guess is that if the actual database/retrieval system returns a list of results when a text field is entered, that functionality will be duplicated using the Python API. If you know of a way to do what you suggest using the actual database (I'm not even sure if this is the right terminology), then I'm sure we can find a way to simulate the behavior via the API. – RocketDonkey Nov 26 '12 at 03:36

getting a gene sequence from entrez using biopython

2 Answers2