4

I am interested in identifying what feature (i.e. gene/cds) is at a particular location of a genome. For instance, what gene (if any) encompasses position 2,000,000. I know how to do this with a for loop and looping through each feature in the genome (code included below), but this is something I'd like to do hundreds of millions of times as part of a randomization study, and this will take much longer than I would like.

Code included below for more specific example of what I'm trying to do:

from Bio import SeqIO
import random
GenomeSeq = SeqIO.read(open("reference_sequence.gbk", "r"), "genbank")

interesting_position = random.randint(0, len(GenomeSeq))

for feature in GenomeSeq.features:  # loop each position through whole genome
    # In this particular case I'm interested in focusing on cds, but
    # in others, I may be interested in other feature types?
    if feature.type == "CDS":  
        if (feature.location._start.position <= interesting_position and 
            feature.location._end.position >= interesting_position):
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

I've thought about making a dictionary with each position within a gene corresponding to a key, and the feature ID as the value, as the lookup would be faster than the looping, but it just seems like there should be a way to do GenomeSeq[interestion_position].qualifiers['gene']

David Cain
  • 16,484
  • 14
  • 65
  • 75
ded
  • 420
  • 2
  • 13
  • Something like `GenomeSeq[interesting_position].features()`, perhaps? – verbsintransit Jul 25 '13 at 23:51
  • @verbsintransit yes that would be great, but it doesn't seem to work, i get an attribute error ('str' object has no attribute 'features'). Is this something that does work, or just something you would like to see work? – ded Jul 26 '13 at 12:46

2 Answers2

1

I've never used BioPython, but I found this located in its documentation:http://biopython.org/wiki/SeqIO

from Bio import SeqIO
handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print records[0].id  #first record
print records[-1].id #last record

Is that what you're looking for?

sihrc
  • 2,728
  • 2
  • 22
  • 43
  • 1
    This accesses different sequence entries based on their order in the fasta file, not identifying specific genes based on a random posistion. – ded Jul 25 '13 at 21:14
1

Quite old, but I will give it a try. Lets suppose you want to check a random list of points for a given genome (and not a fixed set of points for a number of genomes):

from Bio import SeqIO
import random

GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")

# Here I generate a lot of random points in the genome in a set
interesting_positions = set(random.sample(xrange(1, len(GenomeSeq)), 20000))


for feature in GenomeSeq.features:

    if feature.type == "CDS":
        # Here I create a set for every position covered by a feature
        feature_span = set(xrange(feature.location._start.position,
                                  feature.location._end.position))

        # And this is the "trick": check if the two sets overlaps: if any point
        # is in the interesting_positions AND in the points covered by this
        # feature, we are interested in the feature.
        if feature_span & interesting_positions:
            try:
                print feature.qualifiers['gene']
            except KeyError:
                print feature

For 20.000 points and a genome of 4.7Mb the loop takes about 3 seconds in my crapy-2003-computer, 5 seconds for 200.000 random points.


After profiling and a little refactoring, I extracted all the lines that only need to be calculated one time to get to this:

from Bio import SeqIO
import random


def sample_genome(feature_spans, interesting_positions):
  for span in feature_spans:  
      if span & interesting_positions:
          # Do your business here
          print span[1].qualifiers.get("gene") or span[1]

if __name__ == "__main__":
    GenomeSeq = SeqIO.read(open("sequence.gb", "r"), "genbank")
    genome_length = len(GenomeSeq)

    features = [f for f in GenomeSeq.features if f.type == "CDS"]

    spans = [(set(
        xrange(feature.location._start.position,
               feature.location._end.position)), feature)
        for feature in features]

    for i in range(20):
        positions = set(random.sample(xrange(1, genome_length), 200))
        sample_genome(spans, positions)

This consumes about 0.2 seconds per sample + 4 seconds reading/preparing the genome. About 50 hours to do the 1.000.000 samples in this old machine of mine. Being a random sampling, launch some processes in a multicored machine and you're done by tomorrow.

xbello
  • 7,223
  • 3
  • 28
  • 41
  • thanks for the response, this is actually something that i still have not solved. It seems this is effectively the same problem: having to step through the genome features 1 time per randomized set. 200 random points is reasonable, but idealy i'd like to repeat that at least 1 million times (1388 hours). – ded Aug 13 '14 at 13:12
  • @ded I added a fastest code to the answer. Let me know if it serves well. – xbello Aug 13 '14 at 16:13