0

I use Python 2.7 and the regex module. I use this expression to find a short sequence in a longer DNA sequence:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH)

The parameters are :

  • probe : a short string I look for in the genome
  • genome: a long string
  • mismatches : how many differences I allow between the probe/snippet from the genome.

Is there a way to get the positions of all the sequences that match the regex in the genome? Does this script finds overlapping matches? It works pretty well but then I decided to try, say :

probe = "TTGACAT" 
genome = "TTGACATTGACATATAAT" 
mismatches = 0

I got :

['TTGACAT']

With the same parameters but mismatches = 10

I got :

['TTGACAT','GACATAT']

So I do not know if the script finds 'TTGACAT' only once because it overlaps with the second occurence or if it actually finds 'TTGACAT' twice and shows the result only once...

Thanks

WhyOhWhy
  • 5
  • 5
  • Are you sure you don't need something like http://en.wikipedia.org/wiki/Sequence_alignment ? – Dr.Kameleon Mar 01 '14 at 18:16
  • Hey. Thank you for your answer. I agree that the sequence alignment approach would be more efficient. I am still learning how to use the BioPython library -especially the BLAST functionnalities- and needed an "emergency script". For now using a regex is enough. Thanks anyway :) – WhyOhWhy Mar 03 '14 at 08:20

1 Answers1

1

This is because it overlaps with the second occurence.

If you want all overlapping results, you must use the same pattern with the overlapped flag:

output = regex.findall(r'(?:'+probe+'){s<'+str(int(mismatches)+1)+'}', sequence, regex.BESTMATCH, overlapped=True)

If you want to know the sequence position:

for m in regex.finditer(r'(?:'+probe+'){s<'+str(mismatches+1)+'}', sequence, regex.BESTMATCH, overlapped=True):
    print '%d: %s' % (m.start(), m.group())

As an aside comment: The limit with overlapping results

If I use these three parameters:

probe = "ACTG.*ACTG"
sequence = "ACTGTTGACATTGAACTGCATATAATACTG"
mismatches = 0

I will find only two results: ['ACTGTTGACATTGAACTGCATATAATACTG', 'ACTGCATATAATACTG'] instead of three. Because two results can not start at the same position in the string.

Casimir et Hippolyte
  • 88,009
  • 5
  • 94
  • 125
  • It works fine, thanks! int(mismatches) is because the parameters are given by the user through a GUI that returns only str() type. – WhyOhWhy Mar 01 '14 at 17:07
  • Do you have by any chance an idea on how I can retrieve the positions of the results? Thanks – WhyOhWhy Mar 01 '14 at 17:14
  • Np. It works fine for what I plan to use this code snippet for. I have started using the regular expressions only one week ago so I could not find a way to combine the mismatch thing with finditer either. Thanks for your help! :) – WhyOhWhy Mar 01 '14 at 17:46
  • Thanks for your ACTG.*ACTG observation. It is interesting and i'll keep it in mind for a later piece of software I intend to write. However I am using the previous code snippet to test PCR primers and probes since I allow mismatches, but no gaps this programmming subtelty is solved by the "biological" conditions of the problem :) – WhyOhWhy Mar 03 '14 at 08:24