-1

I got a list of around 5000 genes as a search result from Gene Expression Atlas. From the result page i can download all the result in a file. That file contains gene identifiers(Ensembl Gene ID) for each gene. So now i want corresponding EMBL-Bank ID for each Ensembl Gene ID so that i can download their nucleotide sequences in batch from Dbfetch. Anyone knows how can we achieve that? Can we use biopython to achieve that?

user1144004
  • 183
  • 3
  • 4
  • 21

1 Answers1

0

The file you can download is in a custom tab-delimited format (which none of Biopython's parsers are equipped to handle).

Instead, you can just use the csv module to extract what you'd like:

import csv


with open("listd1.tab") as tab_file:
    data_lines = (line for line in csv_file if not line.startswith("#"))
    csv_data = csv.reader(data_lines, delimiter="\t")
    header = csv_data.next()  # ['Gene name', 'Gene identifier', ...]
    gene_id_index = header.find("Gene identifier")

    for line in csv_data:
        gene_id = line[gene_id_index]  # Do whatever you'd like with this
David Cain
  • 16,484
  • 14
  • 65
  • 75