Counting di-Amino Acid frequencies (Bigram frequencies) from FASTA files

Question

Given a large amount of FASTA files (the peptidome for various organisms for secreted peptides), how can I read the FASTA files (from UNIProt) with Python (Or Matlab), and count the frequencies of each Amino Acid, and of amino-acid "double" pairings?

(I.E - the output should have the % of each individual amino acid (Out of the 22 letters/Chars) AND the frequencies of pairings of amino acids.

Effectively, I want to count the bigram (or n-gram if easy to implement) frequencies for letter pairs.

The 22 amino acids are each represented by a unique letter in the FASTA file, and the name of each protein is preceded on its line by >. ( already parsed it, so only relevent characters remain)

Sample of a file:

FFKA

FLRN

MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK KDKIEKSDDALAKTS

Thank you very much!

This shouldn't be too bad using biopython, which I notice you've added as a tag. Can you post what you've done so far? (The [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) has several parsing examples.) — DSM, Aug 19 '12 at 21:59
It is helpful if you at least sketch what have you done or what specifically are you having trouble to implement. http://mattgemmell.com/2008/12/08/what-have-you-tried/ — El Developer, Aug 21 '12 at 03:26
I'm unable to add the biopython library/package the computers I have access to (I do have numPy, SciPy). — GrimSqueaker, Aug 21 '12 at 09:20
A chunk of a FASTA file, post parsing/editing looks like this: FFKA FLRN MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK KDKIEKSDDALAKTS — GrimSqueaker, Aug 21 '12 at 09:21
You should probably solve the problem of installing relevant packages first, and then use them to address the issue. Also, how many n-grams are there in an N-long sequence (N >> n)? ca. N-n or ca. N//n? These are two ways of counting I can think of. — Lev Levitsky, Aug 21 '12 at 09:32
In an N long sequence, there are N*(N-1) possible Bigrams. I don't know how to get started on finding and counting bigram frequencies (such as an initial library/dict) — GrimSqueaker, Aug 21 '12 at 12:02
N*(N-1)? So, 20 in a sequence of 5 aa's? That means that bigram is not two sequential residues, what is it then? Maybe you could [edit] the question to add a definition of what a bigram (or n-gram) actually is in a sequence. — Lev Levitsky, Aug 21 '12 at 12:33

score 3 · Accepted Answer · answered Sep 13 '12 at 01:17

How does this look?

>>> sequence = "LTSDAKAARFSDPQFWKGRFSDPQFWKGRSAAKGRFARTSSGAAEKREPQAAYWKGRF "
>>> occurrenceAA = str(sequence.count("AA"))   # counting occurence of n-aminos
>>> percent_occurrenceAA = float(occurrenceAA)/len(sequence)*100   # calculate percent total of protein
>>> print occurrenceAA, " Double-alanines in your sequence"
4 Double-alanines in your sequence
>>> print round(percent_occurrenceAA,2), " % of total"   # rounding off % to 2 decimal places
6.78  % of total

Thanks! (I ended up dropping the feature), but your code helped me get mine working for tetsing :) — GrimSqueaker, Feb 01 '14 at 11:27

Counting di-Amino Acid frequencies (Bigram frequencies) from FASTA files

1 Answers1