I am interested in identifying what feature (i.e. gene/cds) is at a particular location of a genome. For instance, what gene (if any) encompasses position 2,000,000. I know how to do this with a for
loop and looping through each feature in the genome (code included below), but this is something I'd like to do hundreds of millions of times as part of a randomization study, and this will take much longer than I would like.
Code included below for more specific example of what I'm trying to do:
from Bio import SeqIO
import random
GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank")
interesting_position = random.randint(0, len(GenomeSeq))
for feature in GenomeSeq.features: # loop each position through whole genome
# In this particular case I'm interested in focusing on cds, but
# in others, I may be interested in other feature types?
if feature.type == "CDS":
if (feature.location._start.position <= interesting_position and
feature.location._end.position >= interesting_position):
try:
print feature.qualifiers['gene']
except KeyError:
print feature
I've thought about making a dictionary with each position within a gene corresponding to a key, and the feature ID as the value, as the lookup would be faster than the looping, but it just seems like there should be a way to do GenomeSeq[interestion_position].qualifiers['gene']