inputting and aligning protein sequence

Question

I have a script for finding mutated positions in protein sequence.The following script will do this.

import pandas as pd #data analysis python module
data =     'MTAQDDSYSDGKGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYLGAVFQLN,MTSQEDSYSDGKGNYNTIMPGAVFQLN,MTAQDDSYSDGRGDYNTIMPGAVFQLN,MKAQDDSYSDGRGNYNTIYLGAVFQLQ,MKSQEDSYSDGRGDYNTIYLGAVFQLN,MTAQDDSYSDGRGDYNTIYPGAVFQLN,MTAQEDSYSDGRGEYNTIYLGAVFQLQ,MTAQDDSYSDGKGDYNTIMLGAVFQLN,MTAQDDSYSDGRGEYNTIYLGAVFQLN' #protein sequences

df = pd.DataFrame(map(list,data.split(',')))

I = df.columns[(df.ix[0] != df).any()] 

J = [pd.get_dummies(df[i], prefix=df[i].name+1, prefix_sep='') for i in I] 

print df[[]].join(J)

Here I gave the data(hard coded) ie, input protein sequences .Normally in an application user has to give the input sequences ie, I mean soft coding. Also here alignment is not done.I read biopython tutorial and i got following script,but I don't know how to add these scripts to above one.

from Bio import AlignIO
alignment = AlignIO.read("c:\python27\proj\data1.fasta", "fasta")
print alignment

How can I do these What I have tried :

>>> import sys

>>> import pandas as pd

>>> from Bio import AlignIO

>>> data=sys.stdin.read()
    MTAQDDSYSDGKGDYNTIYLGAVFQLN
    MTAQDDSYSDGRGDYNTIYLGAVFQLN
    MTSQEDSYSDGKGNYNTIMPGAVFQLN
    MTAQDDSYSDGRGDYNTIMPGAVFQLN
    MKAQDDSYSDGRGNYNTIYLGAVFQLQ
    MKSQEDSYSDGRGDYNTIYLGAVFQLN
    MTAQDDSYSDGRGDYNTIYPGAVFQLN
    MTAQEDSYSDGRGEYNTIYLGAVFQLQ
    MTAQDDSYSDGKGDYNTIMLGAVFQLN
    MTAQDDSYSDGRGEYNTIYLGAVFQLN
    ^Z
>>> df=pd.DataFrame(map(list,data.split(',')))
>>> I=df.columns[(df.ix[0]!=df).any()]
>>> J=[pd.get_dummies(df[i],prefix=df[i].name+1,prefix_sep='')for i in I]
>>> print df[[]].join(J)

But it is giving empty DataFrame as output.

I also tried following, but i don't know how to load these sequences into my script

while 1:
 var=raw_input("Enter your sequence here:")
 print "you entered ",var

Please help me.

for the third code snippet: make sure that data is comma separated and not space or new line separated, or change `data.split(',')` to e.g `data.split('\n')` — Francesco Montesano, Feb 07 '13 at 09:41
I recognise [this code](http://stackoverflow.com/questions/14639825/protein-sequence-coding/14641252#14641252) (!) — Andy Hayden, Feb 07 '13 at 16:56

score 1 · Answer 1 · edited May 23 '17 at 12:22

When you read in data via:

sys.stdin.read()

Sequences are separating using '\n' rather than ',' (printing data would confirm whether this is the case, it may be system dependent), so you should split using this:

df = pd.DataFrame(map(list,data.split('\n')))

A good way to check this kind of thing is to go through it line by line, where you would see that df was a one row DataFrame (which then propagates to make I empty).

Aside: what a well written piece of code you are using! :)

inputting and aligning protein sequence

1 Answers1