I use Python 2.7 and the regex module. I use this expression to find a short sequence in a longer DNA sequence:
output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH)
The parameters are :
- probe : a short string I look for in the genome
- genome: a long string
- mismatches : how many differences I allow between the probe/snippet from the genome.
Is there a way to get the positions of all the sequences that match the regex in the genome? Does this script finds overlapping matches? It works pretty well but then I decided to try, say :
probe = "TTGACAT"
genome = "TTGACATTGACATATAAT"
mismatches = 0
I got :
['TTGACAT']
With the same parameters but mismatches = 10
I got :
['TTGACAT','GACATAT']
So I do not know if the script finds 'TTGACAT' only once because it overlaps with the second occurence or if it actually finds 'TTGACAT' twice and shows the result only once...
Thanks