1

I'm trying to index the matches using the new regex findall, so that overlapped matches can be considered. However, I could only find the matches, but can't correctly give locations for them.

My code:

import regex as re
seq = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"
xx=re.findall(substring,seq,overlapped=True)
print xx

xx would look like

['GTTTGCAG', 'GTTTGCAG']

because there are two matches at positions 10-17 and 25-32.

However how could I obtain these numbers please? By checking dir(xx), there is no start/end/pos that I could use in this new function. (I tried xx.index(substring), but this seems to only gives the index for the resulting list: e.g. 0 and 1 in this case)

Thank you.

Helene
  • 953
  • 3
  • 12
  • 22

3 Answers3

3

This iterate for substrings with length equal length of pattern and compare with our pattern. If they are the same, it remember start and end index in string. It simple list comprehension.

sequence = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring = "GTTTGCAG"

def find_indexes(seq, sub):
    return [(sub, i, len(sub)+i) for i in range(0, len(seq), 1) if seq[i:len(sub)+i] == sub]

print find_indexes(sequence, substring)

Out:

[('GTTTGCAG', 9, 17), ('GTTTGCAG', 24, 32)]
  • An explanation might be useful to people reading this. Otherwise it could be a piece of random code – kdopen Mar 11 '15 at 23:26
  • This is a concise non-regex method, I think. And I always enjoy pythony list comprehension solutions! :) – Omega Mar 12 '15 at 01:44
2

Using re.finditer, you can obtain start locations:

import re
seq = "blahblahblahLALALAblahblahLALA"
substring="LALA"
lenss=len(substring)
overlapsearch="(?=(\\"+substring+"))"
xx=[[x.start(),x.start()+lenss] for x in list(re.finditer(overlapsearch,seq))]
check=[seq[x[0]:x[1]] for x in xx]
print xx
print check

Results:

[[12, 16], [14, 18], [26, 30]]
['LALA', 'LALA', 'LALA']

And results using your original example:

[[9, 17], [24, 32]]
['GTTTGCAG', 'GTTTGCAG']

Adding "?=" to the substring search tells regex that the next match can use the characters from the previous match

Omega
  • 133
  • 6
0

If you're not using regular expressions, you can just repeatedly call string.find() with the optional start argument.

For example:

sequence = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"

def find_endpoints(seq, sub):
    off = 0
    matches = []
    while True:
        idx = seq.find(substring, off)
        if idx == -1: break
        matches.append((idx, idx+len(sub)))
        off = idx + 1
    return matches

for (s,e) in find_endpoints(sequence, substring):
    print(s, e, sequence[s:e])

Outputs:

(9, 17, 'GTTTGCAG')
(24, 32, 'GTTTGCAG')

Note: (s,e) are the start index (inclusive) and end index (exclusive) of the substring.

jedwards
  • 29,432
  • 3
  • 65
  • 92