Match and index all substrings, including overlapping ones

Question

I'm trying to index the matches using the new regex findall, so that overlapped matches can be considered. However, I could only find the matches, but can't correctly give locations for them.

My code:

import regex as re
seq = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"
xx=re.findall(substring,seq,overlapped=True)
print xx

xx would look like

['GTTTGCAG', 'GTTTGCAG']

because there are two matches at positions 10-17 and 25-32.

However how could I obtain these numbers please? By checking dir(xx), there is no start/end/pos that I could use in this new function. (I tried xx.index(substring), but this seems to only gives the index for the resulting list: e.g. 0 and 1 in this case)

Thank you.

score 3 · Answer 1 · 2015-03-12T00:07:24.457

3

This iterate for substrings with length equal length of pattern and compare with our pattern. If they are the same, it remember start and end index in string. It simple list comprehension.

sequence = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring = "GTTTGCAG"

def find_indexes(seq, sub):
    return [(sub, i, len(sub)+i) for i in range(0, len(seq), 1) if seq[i:len(sub)+i] == sub]

print find_indexes(sequence, substring)

Out:

[('GTTTGCAG', 9, 17), ('GTTTGCAG', 24, 32)]

edited Mar 12 '15 at 00:07

answered Mar 11 '15 at 23:21

An explanation might be useful to people reading this. Otherwise it could be a piece of random code – kdopen Mar 11 '15 at 23:26
This is a concise non-regex method, I think. And I always enjoy pythony list comprehension solutions! :) – Omega Mar 12 '15 at 01:44

Omega · Accepted Answer · 2015-03-11T23:50:00.680

2

Using re.finditer, you can obtain start locations:

import re
seq = "blahblahblahLALALAblahblahLALA"
substring="LALA"
lenss=len(substring)
overlapsearch="(?=(\\"+substring+"))"
xx=[[x.start(),x.start()+lenss] for x in list(re.finditer(overlapsearch,seq))]
check=[seq[x[0]:x[1]] for x in xx]
print xx
print check

Results:

[[12, 16], [14, 18], [26, 30]]
['LALA', 'LALA', 'LALA']

And results using your original example:

[[9, 17], [24, 32]]
['GTTTGCAG', 'GTTTGCAG']

Adding "?=" to the substring search tells regex that the next match can use the characters from the previous match

edited Mar 11 '15 at 23:50

answered Mar 11 '15 at 23:12

Omega

133
6

Hackishly fixed overlapping issue, I think ;) – Omega Mar 11 '15 at 23:38

score 0 · Answer 3 · answered Mar 11 '15 at 23:12

If you're not using regular expressions, you can just repeatedly call string.find() with the optional start argument.

For example:

sequence = "ATCCAAGGAGTTTGCAGAGGTGGCGTTTGCAGCATGAGAT"
substring="GTTTGCAG"

def find_endpoints(seq, sub):
    off = 0
    matches = []
    while True:
        idx = seq.find(substring, off)
        if idx == -1: break
        matches.append((idx, idx+len(sub)))
        off = idx + 1
    return matches

for (s,e) in find_endpoints(sequence, substring):
    print(s, e, sequence[s:e])

Outputs:

(9, 17, 'GTTTGCAG')
(24, 32, 'GTTTGCAG')

Note: (s,e) are the start index (inclusive) and end index (exclusive) of the substring.

Match and index all substrings, including overlapping ones

3 Answers3