2

I am working on a script able to make an approximate match of a certain pattern in a string, reporting just the positions in which these patterns (they could be overlapping) initiate.

So far, I obtained a script able to report the positions of the exact match, but with no success for approximate ones:

import re
stn = 'KLHLHLHKPLHLHLPHHKLHKLPKPH'
pat = 'KLH'
matches = re.finditer(r'(?=(%s))' % re.escape(pat), stn)
finalmatch= [m.start() for m in matches]
pos = ' '.join(str(v) for v in finalmatch)
print pos

the result in this case is: 0 17 but what if the script report also approximate matches? i.e. if the maximum permitted error (tolerance or threshold) is 1 (in any position of the query pattern), how can the initial positions of HLH, PLH, KLP, KPH be reported?

I already tried to include distance measure like Levenshtein or SequenceMatcher, but with no success.

Thanks in advance for your help.

Andrés F
  • 73
  • 6

2 Answers2

1

A basic way:

  • Group stn consecutive chunks of n chars where n is len(ptn)
  • Count how many chars are identical between each chunk and ptn
  • Get start of how many of these are one char different from len(ptn)

eg:

stn = 'KLHLHLHKPLHLHLPHHKLHKLPKPH'
pat = 'KLH'

n_combos = zip(*[stn[n:] for n in range(len(pat))])
m_counts = (sum(1 for i, j in zip(el, pat) if i == j) for el in n_combos)
indices = [idx for idx, val in enumerate(m_counts) if val >= len(pat) - 1]
# [0, 2, 4, 8, 10, 17, 20, 23]
Jon Clements
  • 138,671
  • 33
  • 247
  • 280
  • Thanks for your answer. If I need the most frequent pats (of a certain lenght L) instead the positions could I replace pat variable like this? L=3 max = 0 maxpatt = [] for p,f in m_counts.iteritems(): if f > max: maxpatt = [p] max = f elif f == max: maxpatt += [p] – Andrés F Nov 15 '13 at 00:44
0

Just change the pattern:

import re
from itertools import chain
stn = 'KLHLHLHKPLHLHLPHHKLHKLPKPH'
pats = ['KLH', 'KL, 'LH, 'K', 'L', 'H']
matches = []
for pat in pats:
    matches = chain(matches, (re.finditer(r'(?=(%s))' % re.escape(pat), stn))
finalmatch= [m.start() for m in matches]
pos = ' '.join(str(v) for v in finalmatch)
print pos
Steinar Lima
  • 7,644
  • 2
  • 39
  • 40
  • I understand the idea, but the script provokes an error because of the incomplete substrings like 'LH (invalid syntax). I do not know how to resolve the problem, without complicate it too much just as in the case of use a matrix, or something of the sort. – Andrés F Nov 15 '13 at 00:26
  • I just realized that you are using `finditer()`, this means that you need to use `chain()`instead of list extension. Maybe this can help? If not, can you provide the traceback you get when you run the code? – Steinar Lima Nov 15 '13 at 00:30