-1

Given a large amount of FASTA files (the peptidome for various organisms for secreted peptides), how can I read the FASTA files (from UNIProt) with Python (Or Matlab), and count the frequencies of each Amino Acid, and of amino-acid "double" pairings?

(I.E - the output should have the % of each individual amino acid (Out of the 22 letters/Chars) AND the frequencies of pairings of amino acids.

Effectively, I want to count the bigram (or n-gram if easy to implement) frequencies for letter pairs.

The 22 amino acids are each represented by a unique letter in the FASTA file, and the name of each protein is preceded on its line by >. ( already parsed it, so only relevent characters remain)

Sample of a file:

FFKA

FLRN

MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK KDKIEKSDDALAKTS

Thank you very much!

Community
  • 1
  • 1
GrimSqueaker
  • 412
  • 5
  • 17
  • 2
    This shouldn't be too bad using biopython, which I notice you've added as a tag. Can you post what you've done so far? (The [tutorial](http://biopython.org/DIST/docs/tutorial/Tutorial.html) has several parsing examples.) – DSM Aug 19 '12 at 21:59
  • 1
    It is helpful if you at least sketch what have you done or what specifically are you having trouble to implement. http://mattgemmell.com/2008/12/08/what-have-you-tried/ – El Developer Aug 21 '12 at 03:26
  • I'm unable to add the biopython library/package the computers I have access to (I do have numPy, SciPy). – GrimSqueaker Aug 21 '12 at 09:20
  • A chunk of a FASTA file, post parsing/editing looks like this: FFKA FLRN MTTVSYVTILLTVLVQVLTSDAKATNNKRELSSGLKERSLSDDAPQFWKGRFSRSEEDPQ FWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQFWKGRFSDPQ FWKGRFSDGTKRENDPQYWKGRFSRSFEDQPDSEAQFWKGRFARTSSGEKREPQYWKGRF SRDSVPGRYGRELQGRFGRELQGRFGREAQGRFGRELQGRFGREFQGRFGREDQGRFGRE DQGRFGREDQGRFGREDQGRFGREDQGRFGREDQGRFGRELQGRFGREFQGRFGREDQGR FGREDQGRFGRELQGRFGREDQGRFGREDQGRFGREDLAKEDQGRFGREDLAKEDQGRFG REDIAEADQGRFGRNAAAAAAAAAAAKKRTIDVIDIESDPKPQTRFRDGKDMQEKRKVEK KDKIEKSDDALAKTS – GrimSqueaker Aug 21 '12 at 09:21
  • 1
    You should probably solve the problem of installing relevant packages first, and then use them to address the issue. Also, how many n-grams are there in an N-long sequence (N >> n)? ca. N-n or ca. N//n? These are two ways of counting I can think of. – Lev Levitsky Aug 21 '12 at 09:32
  • In an N long sequence, there are N*(N-1) possible Bigrams. I don't know how to get started on finding and counting bigram frequencies (such as an initial library/dict) – GrimSqueaker Aug 21 '12 at 12:02
  • N*(N-1)? So, 20 in a sequence of 5 aa's? That means that bigram is not two sequential residues, what is it then? Maybe you could [edit] the question to add a definition of what a bigram (or n-gram) actually is in a sequence. – Lev Levitsky Aug 21 '12 at 12:33

1 Answers1

3

How does this look?

>>> sequence = "LTSDAKAARFSDPQFWKGRFSDPQFWKGRSAAKGRFARTSSGAAEKREPQAAYWKGRF "
>>> occurrenceAA = str(sequence.count("AA"))   # counting occurence of n-aminos
>>> percent_occurrenceAA = float(occurrenceAA)/len(sequence)*100   # calculate percent total of protein
>>> print occurrenceAA, " Double-alanines in your sequence"
4 Double-alanines in your sequence
>>> print round(percent_occurrenceAA,2), " % of total"   # rounding off % to 2 decimal places
6.78  % of total
chimpsarehungry
  • 1,775
  • 2
  • 17
  • 28