Retrieve EMBL-Bank ID through corresponding Ensembl Gene ID in batch

Question

I got a list of around 5000 genes as a search result from Gene Expression Atlas. From the result page i can download all the result in a file. That file contains gene identifiers(Ensembl Gene ID) for each gene. So now i want corresponding EMBL-Bank ID for each Ensembl Gene ID so that i can download their nucleotide sequences in batch from Dbfetch. Anyone knows how can we achieve that? Can we use biopython to achieve that?

Have you made any attempts to solve these issues? Do you have any code to show? — David Cain, Jun 17 '13 at 15:13

score 0 · Answer 1 · answered Jun 17 '13 at 15:12

The file you can download is in a custom tab-delimited format (which none of Biopython's parsers are equipped to handle).

Instead, you can just use the csv module to extract what you'd like:

import csv


with open("listd1.tab") as tab_file:
    data_lines = (line for line in csv_file if not line.startswith("#"))
    csv_data = csv.reader(data_lines, delimiter="\t")
    header = csv_data.next()  # ['Gene name', 'Gene identifier', ...]
    gene_id_index = header.find("Gene identifier")

    for line in csv_data:
        gene_id = line[gene_id_index]  # Do whatever you'd like with this

Retrieve EMBL-Bank ID through corresponding Ensembl Gene ID in batch

1 Answers1