Fetching genomic sequence efficiently in Python?

Question

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

score 4 · Answer 1 · answered Dec 04 '12 at 01:45

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

score 2 · Accepted Answer · answered Jul 07 '10 at 11:27

2

See my answer to your question over at Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:

region = rec.seq[start:end]

to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

answered Jul 07 '10 at 11:27

Brad Chapman

395
1
7

1

I agree that this approach is really elegant because you have a standard library to use but I found it to be very slow. If you assume a fasta file with no newlines, you can simply "seek" the coordinates in the file which I think is much faster, and it doesn't require you to load all the fasta files from every chromosome into memory. Is there a way to achieve the same kind of efficiency with a standard library like biopython? thanks. – Jul 08 '10 at 02:12
It's not exactly clear what you are looking for, but I agree that a custom solution tailored to your specific files will be faster than a more general solution. In practice most FASTA files have line breaks and what not, so I prefer being general but your experience may vary. – Brad Chapman Jul 08 '10 at 14:08
I know this is an old thread, but I'm hoping someone else will notice and bring up more recent news. Samtools (samtools.sourceforge.net) has a function called faidx that allows you to do exactly this on the command line. Pysam (https://github.com/pysam-developers/pysam) wraps many of the samtools methods, but I can't seem to get the faidx function to work. :P At worst, I can wrap the commandline tool, but I'd like something "native" if possible. – Gordon Bean Oct 09 '14 at 23:57

score 1 · Answer 3 · answered Jul 07 '10 at 04:11

1

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

answered Jul 07 '10 at 04:11

slacy

11,397
8
56
61

It does, but I can see it only supports reading of records from FASTA, not fetching of sequences... if you wanted to fetch a sequence (start, end) from FASTA, you'd need a format with no new lines and the right interface and I don't think BioPython supports that. Maybe I missed something though - could you point to the relevant doc on this? Thank you! – Jul 07 '10 at 05:05

score 0 · Answer 4 · answered Mar 18 '11 at 08:13

0

pyfasta is the module you're looking for. From the description

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

https://github.com/brentp/pyfasta

answered Mar 18 '11 at 08:13

Viktiglemma

912
8
19

Fetching genomic sequence efficiently in Python?

4 Answers4