protein sequence coding

Question

I'm working on a python program to compute a numerical coding of mutated residues and positions of a set of strings (protein sequences), stored in fasta format file with each protein sequence is separated by comma. I'm trying to find the position and sequences which are mutated.

My fasta file is as follows:

MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN

Example:
The following figure (based on another set of fasta file) will explain the algorithm behind this. In this figure first box represents alignment of input file sequences. The last box represents the output file. How can I do this with my fasta file in Python?

example input file:

MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD


        positions  1  2  3  4  5  6                  1  2  3  4  5  6    
protein sequence1  M  T  A  Q  D  D                     T  A     D
protein sequence2  M  T  A  Q  D  D                     T  A     D    
protein sequence3  M  T  S  Q  E  D                     T  S     E    
protein sequence4  M  T  A  Q  D  D                     T  A     D    
protein sequence5  M  K  A  Q  H  D                     K  A     H

     PROTEIN SEQUENCE ALIGNMENT                   DISCARD NON-VARIABLE REGION    

        positions  2  2  3  3  5  5  5    
protein sequence1  T     A     D       
protein sequence2  T     A     D       
protein sequence3  T        S     E    
protein sequence4  T     A     D       
protein sequence5     K  A           H

MUTATED RESIDUE IS SPLITED TO SEPARATE COLUMN

Output file should be like this:

position+residue   2T  2K  3A  3S  5D  5E  5H    
       sequence1   1   0   1   0   1   0   0    
       sequence2   1   0   1   0   1   0   0    
       sequence3   1   0   0   1   0   1   0    
       sequence4   1   0   1   0   1   0   0    
       sequence5   0   1   1   0   0   0   1

    (RESIDUES ARE CODED 1 IF PRESENT, 0 IF ABSENT)

Here are two ways I have tried to do it:

ls= 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'.split(',')
pos = [set(enumerate(x, 1)) for x in ls]
a=set().union(*pos)
alle = sorted(set().union(*pos))
print '\t'.join(str(x) + y for x, y in alle)
for p in pos:
    print '\t'.join('1' if key in p else '0' for key in alle)

(here I'm getting columns of mutated as well as non-mutated residues, but I want only columns for mutated residues)

from pandas import *
data = 'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN'  
df = DataFrame([list(row) for row in data.split(',')])
df = DataFrame({str(col+1)+val:(df[col]==val).apply(int) for col in df.columns for val in set(df[col])})
print df.select(lambda x: not df[x].all(), axis = 1)

(here it is giving output ,but not in orderly ie, first 2K then 2T then 3A like that.)

How should I be doing this?

are you [sure](http://stackoverflow.com/questions/14498730/numerical-coding-of-mutated-residues-and-positions/14500175#14500175) or [this](http://stackoverflow.com/questions/14498730/numerical-coding-of-mutated-residues-and-positions) — root, Feb 01 '13 at 06:28

Andy Hayden · Accepted Answer · 2013-02-01T07:03:02.290

The function get_dummies gets you most of the way:

In [11]: s
Out[11]: 
0    T
1    T
2    T
3    T
4    K
Name: 1

In [12]: pd.get_dummies(s, prefix=s.name, prefix_sep='')
Out[12]: 
   1K  1T
0   0   1
1   0   1
2   0   1
3   0   1
4   1   0

And those columns which have differing values:

In [21]: (df.ix[0] != df).any()
Out[21]: 
0    False
1     True
2     True
3    False
4     True
5    False

Putting these together:

In [31]: I = df.columns[(df.ix[0] != df).any()]

In [32]: J = [pd.get_dummies(df[i], prefix=df[i].name, prefix_sep='') for i in I]

In [33]: df[[]].join(J)
Out[33]: 
   1K  1T  2A  2S  4D  4E  4H
0   0   1   1   0   1   0   0
1   0   1   1   0   1   0   0
2   0   1   0   1   0   1   0
3   0   1   1   0   1   0   0
4   1   0   1   0   0   0   1

Note: I created the initial DataFrame as follows, however this may be done more efficiently depending on your situation:

df = pd.DataFrame(map(list, 'MTAQDD,MTAQDD,MTSQED,MTAQDD,MKAQHD'.split(',')))

protein sequence coding

1 Answers1

Linked