1

I am aware that there are solutions for rosalind challenges but I do not want them to spoil the fun. I thought I found a solution for "Finding a shared motif" yet my answer is wrong all the time.

The question is about finding the longest common substring(s) in a given sheet which is made of a line starting with ">" and the next lines until another line starting with a ">" are composing a sequence. here is how it looks like:

>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA

There are like a hundred dna pieces and you are to find to longest common subsequence. Here is my approach:

    rosa = open("rosalind_lcsm.txt","r")
    oku = rosa.readlines()
    strs=[]
    for line in oku:
        if line.startswith(">"):
            strs.append("kiko")
        else:
            strs.append(line)
    rosa.close()
    strs = strs[1:]
    joint = "".join(strs)
    joint_s = joint.split("kiko")

    theOne = joint_s[0]
    rest = joint_s[1:]

    start=0
    end=1
    matches=[]

    while end < len(theOne):
        end+=1
        while all(theOne[start:end] in seq for seq in rest):
            end+=1
        else:
            matches.append(theOne[start:end-1])
            end+=1
        start=end-1
    print(max(matches, key=len))

My strategy was as; read the file, split it into sequences, pick the first sequence and compare its common parts to the rest. I am checking minimum 2 matches as the sequences are made of ATGC and 1 match will definitely occur. It starts from a character and keeps on expanding it by 1 characters until the match is broken. Then it takes the last matching bit and appends into a list. Then restarts from where it stopped.

My solution gives an answer yet it is not the right one and I cannot spot the misleading part in the code. Can someone try to understand my approach and give me an advice on fixing it?

Fırat Uyulur
  • 149
  • 1
  • 11

1 Answers1

1

I don't speak python, but I think you're skipping possible matches by doing start=end-1. You probably need to do start=start+1.

For example, let's say you have these strings:

GATCAA GAGCAATCAA

Your algorithm will first find GA as common substring, and then continue looking from the third character. But that way you're missing the real longest common substring, ATCAA.

EDIT: Obviously you also need to reinitialize end together with start. Either you set it to start+1 to always start looking from a two-letter string like you're doing, or you can optimize your code by starting from the length of the longest match you've found so far.

GertG
  • 959
  • 1
  • 8
  • 21