0

I have written the following python code to solve one of the Rosalind problems (http://rosalind.info/problems/cons/) and for some reason, Rosalind says the answer is wrong but I did some spot-checking and it appears right.

The problem is as follows:

Given: A collection of at most 10 DNA strings of equal length (at most 1 kbp) in FASTA format.

Return: A consensus string and profile matrix for the collection. (If several possible consensus strings exist, then you may return any one of them.)

A sample dataset is:

>Rosalind_1
ATCCAGCT
>Rosalind_2
GGGCAACT
>Rosalind_3
ATGGATCT
>Rosalind_4
AAGCAACC
>Rosalind_5
TTGGAACT
>Rosalind_6
ATGCCATT
>Rosalind_7
ATGGCACT

A sample solution is:

ATGCAACT
A: 5 1 0 0 5 5 0 0
C: 0 0 1 4 2 0 6 1
G: 1 1 6 3 0 1 0 0
T: 1 5 0 0 0 1 1 6

My attempt to solve this:

from Bio import SeqIO

A,C,G,T = [],[],[],[]
consensus=""

for i in range(0,len(record.seq)):
    countA,countC,countG,countT=0,0,0,0
    for record in SeqIO.parse("fasta.txt", "fasta"):
        if record.seq[i]=="A":
            countA=countA+1
        if record.seq[i]=="C":
            countC=countC+1
        if record.seq[i]=="G":
            countG=countG+1
        if record.seq[i]=="T":
            countT=countT+1

    A.append(countA)
    C.append(countC)
    G.append(countG)
    T.append(countT)

    if countA >= max(countC,countG,countT):
        consensus=consensus+"A"
    elif countC >= max(countA,countG,countT):
        consensus=consensus+"C"
    elif countG >= max(countA,countC,countT):
        consensus=consensus+"G"
    elif countT >= max(countA,countC,countG):
        consensus=consensus+"T"

print("A: "+" ".join([str(i) for i in A]))
print("C: "+" ".join([str(i) for i in C]))
print("G: "+" ".join([str(i) for i in G]))
print("T: "+" ".join([str(i) for i in T]))

print(consensus)

Would be great if someone can take a look and suggest what I am doing wrong? Many thanks!

Radwa Sharaf
  • 159
  • 14

2 Answers2

1

For your consensus string, your code is not handling the case in which you have a tie, i.e., two nucleotides in a given position are equally frequent. The way your code is written now, this case will result in nothing being printed at that position in the consensus string

Niema Moshiri
  • 909
  • 5
  • 14
  • Thanks Niema for the suggestion. Excellent point. I edited the count if statements to be > or equal, but I am afraid that didn't solve it either. They say: "If several possible consensus strings exist, then you may return any one of them" so I would have thought iterating over the bases as such would be okay. – Radwa Sharaf Nov 21 '17 at 23:20
  • That is still incorrect. If you simply change > to >= in your current code, you will output *all* nucleotides that tied for most frequent at that position. You should output just 1 of the nucleotides that tied for most frequent at that position – Niema Moshiri Nov 22 '17 at 00:15
  • Just to give you a hint in the right direction: the `continue` keyword of Python can make your code correct after the > to >= update – Niema Moshiri Nov 22 '17 at 00:25
  • Thanks for the suggestion Niema. I am not sure what you mean since I have an "elif" statement, which shouldn't execute if the previous statement was TRUE. For example: x=4 y=3 if x<10: print("yes") elif y<30: print("also yes") The output of that is only "yes". – Radwa Sharaf Nov 22 '17 at 18:58
  • Update, the code as is worked and gave me the correct answer in Rosalind :) I am not sure what you meant by the "continue" – Radwa Sharaf Nov 22 '17 at 20:35
0

in this part

 if countA >= max(countC,countG,countT):
            consensus=consensus+"A"
        elif countC >= max(countA,countG,countT):
            consensus=consensus+"C"
        elif countG >= max(countA,countC,countT):
            consensus=consensus+"G"
        elif countT >= max(countA,countC,countG):
            consensus=consensus+"T"

Use this instead and you will get your Consensus sequences correctly

if countA[i] >= max(countC[i],countG[i],countT[i]):
      consensus+="A"
  if countC[i] >= max(countA[i],countG[i],countT[i]):
      consensus+="C"     
  if countG[i] >= max(countA[i],countC[i],countT[i]):
      consensus+="G"
  if countT[i] >= max(countA[i],countC[i],countG[i]):
      consensus+="T"
Midhat
  • 1