5

How can I fetch genomic sequence efficiently using Python? For example, from a .fa file or some other easily obtained format? I basically want an interface fetch_seq(chrom, strand, start, end) which will return the sequence [start, end] on the given chromosome on the specified strand.

Analogously, is there a programmatic python interface for getting phastCons scores?

thanks.

4 Answers4

4

Retrieving sequence data from large human chromosome files can be inefficient memory-wise, so if you're looking for computational efficiency you can format the sequence data to a packed binary string and lookup based on byte location. I wrote routines to do this in perl (available here ), and python has the same pack and unpack routines - so it can be done, but only worth it if you're running in to trouble with large files on a limited machine. Otherwise use biopython SeqIO

RussD
  • 41
  • 3
2

See my answer to your question over at Biostar:

http://biostar.stackexchange.com/questions/1639/getting-genomic-sequences-and-phastcons-scores-using-python-from-ensembl-ucsc

Use SeqIO with Fasta files and you'll get back record objects for each item in the file. Then you can do:

region = rec.seq[start:end]

to pull out slices. The nice thing about using a standard library is you don't have to worry about the line breaks in the original fasta file.

Brad Chapman
  • 395
  • 1
  • 7
  • 1
    I agree that this approach is really elegant because you have a standard library to use but I found it to be very slow. If you assume a fasta file with no newlines, you can simply "seek" the coordinates in the file which I think is much faster, and it doesn't require you to load all the fasta files from every chromosome into memory. Is there a way to achieve the same kind of efficiency with a standard library like biopython? thanks. –  Jul 08 '10 at 02:12
  • It's not exactly clear what you are looking for, but I agree that a custom solution tailored to your specific files will be faster than a more general solution. In practice most FASTA files have line breaks and what not, so I prefer being general but your experience may vary. – Brad Chapman Jul 08 '10 at 14:08
  • I know this is an old thread, but I'm hoping someone else will notice and bring up more recent news. Samtools (samtools.sourceforge.net) has a function called faidx that allows you to do exactly this on the command line. Pysam (https://github.com/pysam-developers/pysam) wraps many of the samtools methods, but I can't seem to get the faidx function to work. :P At worst, I can wrap the commandline tool, but I'd like something "native" if possible. – Gordon Bean Oct 09 '14 at 23:57
1

Take a look at biopython, which has support for several gene sequence formats. Specifically, it has support for FASTA and GenBank files, to name a couple.

slacy
  • 11,397
  • 8
  • 56
  • 61
  • It does, but I can see it only supports reading of records from FASTA, not fetching of sequences... if you wanted to fetch a sequence (start, end) from FASTA, you'd need a format with no new lines and the right interface and I don't think BioPython supports that. Maybe I missed something though - could you point to the relevant doc on this? Thank you! –  Jul 07 '10 at 05:05
0

pyfasta is the module you're looking for. From the description

fast, memory-efficient, pythonic (and command-line) access to fasta sequence files

https://github.com/brentp/pyfasta

Viktiglemma
  • 912
  • 8
  • 19