There's an Open Reading Frame exercise on Rosalind, for which I get different results from what is obtained in the example task. The exercise description can be found here.
I have this code:
gencode = {"GCT": "A", "GCC": "A", "GCA": "A", "GCG": "A",
"TGT": "C", "TGC": "C",
"GAT": "D", "GAC": "D",
"GAA": "E", "GAG": "E",
"TTT": "F", "TTC": "F",
"GGT": "G", "GGC": "G", "GGA": "G", "GGG": "G",
"CAT": "H", "CAC": "H",
"ATA": "I", "ATT": "I", "ATC": "I",
"AAA": "K", "AAG": "K",
"TTA": "L", "TTG": "L", "CTT": "L", "CTC": "L", "CTA": "L", "CTG": "L",
"ATG": "M",
"AAT": "N", "AAC": "N",
"CCT": "P", "CCC": "P", "CCA": "P", "CCG": "P",
"CAA": "Q", "CAG": "Q",
"CGT": "R", "CGC": "R", "CGA": "R", "CGG": "R", "AGA": "R", "AGG": "R",
"TCT": "S", "TCC": "S", "TCA": "S", "TCG": "S", "AGT": "S", "AGC": "S",
"ACT": "T", "ACC": "T", "ACA": "T", "ACG": "T",
"GTT": "V", "GTC": "V", "GTA": "V", "GTG": "V",
"TGG": "W",
"TAT": "Y", "TAC": "Y",
"TAA": "_", "TAG": "_", "TGA": "_"}
seq = 'AGCCATGTAGCTAACTCAGGTTACATGGGGATGACCCCGCGACTTGGATTAGAGTCTCTTTTGGAATAAGCCTGAATGATCCGAGTAGCATCTCAG'
rev_seq = seq[::-1]
def get_orf_proteins(seq):
proteins=[]
for i in range(len(seq)-2):
if gencode[seq[i:i+3]] == 'M':
print(i)
prot = ''
k = i
while gencode[seq[k:k+3]] != '_' and k < len(seq)-3:
prot += gencode[seq[k:k+3]]
k += 3
proteins.append(prot)
return(list(set(proteins)))
print(get_orf_proteins(seq))
print(get_orf_proteins(rev_seq))
Which returns the following protein sequences:
['MGMTPRLGLESLLE', 'MTPRLGLESLLE', 'M', 'MIRVAS']
['MY', 'MSLVSPNKVFSEIRFSAPVGVHWTQSMY']
Am I missing something or rather the example solution is incorrect?