I am aware that there are solutions for rosalind challenges but I do not want them to spoil the fun. I thought I found a solution for "Finding a shared motif" yet my answer is wrong all the time.
The question is about finding the longest common substring(s) in a given sheet which is made of a line starting with ">" and the next lines until another line starting with a ">" are composing a sequence. here is how it looks like:
>Rosalind_1
GATTACA
>Rosalind_2
TAGACCA
>Rosalind_3
ATACA
There are like a hundred dna pieces and you are to find to longest common subsequence. Here is my approach:
rosa = open("rosalind_lcsm.txt","r")
oku = rosa.readlines()
strs=[]
for line in oku:
if line.startswith(">"):
strs.append("kiko")
else:
strs.append(line)
rosa.close()
strs = strs[1:]
joint = "".join(strs)
joint_s = joint.split("kiko")
theOne = joint_s[0]
rest = joint_s[1:]
start=0
end=1
matches=[]
while end < len(theOne):
end+=1
while all(theOne[start:end] in seq for seq in rest):
end+=1
else:
matches.append(theOne[start:end-1])
end+=1
start=end-1
print(max(matches, key=len))
My strategy was as; read the file, split it into sequences, pick the first sequence and compare its common parts to the rest. I am checking minimum 2 matches as the sequences are made of ATGC and 1 match will definitely occur. It starts from a character and keeps on expanding it by 1 characters until the match is broken. Then it takes the last matching bit and appends into a list. Then restarts from where it stopped.
My solution gives an answer yet it is not the right one and I cannot spot the misleading part in the code. Can someone try to understand my approach and give me an advice on fixing it?