How to implement a NFA or DFA based regexp matching algorithm to find all matches?

Question

The matches can be overlapped.

But if multiple matches are found starting from a same position, pick the short one.

For example, to find regexp parttern "a.*d" in a string "abcabdcd", the answer should be {"abcabd", "abd"}. And "abcabdcd" and "abdcd" should not be included.

Standard RE algorithms are greedy by default, meaning the quantifiers will try to match as much as possible. Thus, `a.*d` would match the whole of `abcabdcd`. Thus, you _need_ a non-greedy match strategy (i.e., match happens the first time you enter an accepting state). — Donal Fellows, Feb 20 '12 at 09:37
Actually, the restriction of picking short matches may be loosen or discarded. What I really want is the time efficiency. So we may need some changes in the internal implementation of regexp: NFA or DFA. — Rubbish_Oh, Feb 20 '12 at 10:12

hochl · Answer 1 · 2012-02-20T09:34:22.050

This function is rather inefficient, but it solves your problem:

def find_shortest_overlapping_matches(pattern, line):
        pat=re.compile(pattern)
        n=len(line)
        ret=[]
        for start in xrange(0, n):
                for end in xrange(start+1, n+1):
                        tmp=line[start:end]
                        mat=pat.match(tmp)
                        if mat is not None:
                                ret.append(tmp)
                                break
        return ret

print find_shortest_overlapping_matches("a.*d", "abcabdcd")

Output:

['abcabd', 'abd']

The ranges assume your pattern contains at least one character and does not match an empty string. Additionally, you should consider using ? to make your patterns match non-greedily to improve performance and avoid the inner loop.

Actually, your function is actually the way i've thought about. But what i want is a more efficient algorithm. Thank you all the same. — Rubbish_Oh, Feb 20 '12 at 10:00

score 0 · Accepted Answer · answered Feb 20 '12 at 09:47

0

Most RE engines only match an RE once and greedily by default, and standard iteration strategies built around them tend to restart the search after the end of the previous match. To do other than that requires some extra trickery. (This code is Tcl, but you should be able to replicate it in many other languages.)

proc matchAllOverlapping {RE string} {
    set matches {}
    set nonGreedyRE "(?:${RE}){1,1}?"
    set idx 0
    while {[regexp -indices -start $idx $nonGreedyRE $string matchRange]} {
        lappend matches [string range $string {*}$matchRange]
        set idx [expr { [lindex $matchRange 0] + 1 }]
    }
    return $matches
}
puts [matchAllOverlapping a.*d abcabdcd]

answered Feb 20 '12 at 09:47

Donal Fellows

133,037
18
149
215

Alas, this is one of the times when I miss having an option to just do the whole compilation in “non-greedy by default” mode without resorting to trickery with non-greedy wrapping. Haven't got that though… – Donal Fellows Feb 20 '12 at 09:50
Actually, I can't understand the tcl code. Are you still use the standard iteration strategy or anything else? If not, then can you explain me about your idea? – Rubbish_Oh Feb 20 '12 at 10:06
First, I convert the RE to non-greedy. Then I repeatedly find the first match from the “current” index (initially the start of the string) accumulating the matched substrings and setting the current index to one character _after_ the start of the previously-found match. Once nothing matches, I've got the list of matched substrings I wanted. – Donal Fellows Feb 20 '12 at 11:18
1

@Rubbish_Oh All algorithms must be at least O(n²) (assuming that RE matching is linear, which it can be for simple cases like yours; forcing to non-greedy helps) but there's no reason to use the horrendous O(n³) approach of hochl's answer. – Donal Fellows Feb 20 '12 at 11:27
Since you "set the current index to one character after the start of the previously-found match", then where are the overlapped matches? – Rubbish_Oh Feb 20 '12 at 11:55
@Rubbish_Oh The “current” index (i.e., the `idx` variable) is the point at which the RE engine will commence searching. We know that the place where the RE matched the previous time round gives a particular match — we've just found it after all — so we know that the next eligible match point must start _at least_ one character later, and hence that's the place where we start searching from the next time round. – Donal Fellows Feb 20 '12 at 13:29
If you wanted **non**-overlapping matches, Tcl can find those as a one-liner. Many other languages have equally short idioms for that. – Donal Fellows Feb 20 '12 at 13:30

How to implement a NFA or DFA based regexp matching algorithm to find all matches?

2 Answers2