I got a list of around 5000 genes as a search result from Gene Expression Atlas. From the result page i can download all the result in a file. That file contains gene identifiers(Ensembl Gene ID) for each gene. So now i want corresponding EMBL-Bank ID for each Ensembl Gene ID so that i can download their nucleotide sequences in batch from Dbfetch. Anyone knows how can we achieve that? Can we use biopython to achieve that?
Asked
Active
Viewed 206 times
-1
-
1Have you made any attempts to solve these issues? Do you have any code to show? – David Cain Jun 17 '13 at 15:13
1 Answers
0
The file you can download is in a custom tab-delimited format (which none of Biopython's parsers are equipped to handle).
Instead, you can just use the csv
module to extract what you'd like:
import csv
with open("listd1.tab") as tab_file:
data_lines = (line for line in csv_file if not line.startswith("#"))
csv_data = csv.reader(data_lines, delimiter="\t")
header = csv_data.next() # ['Gene name', 'Gene identifier', ...]
gene_id_index = header.find("Gene identifier")
for line in csv_data:
gene_id = line[gene_id_index] # Do whatever you'd like with this

David Cain
- 16,484
- 14
- 65
- 75