How do regex engines account for irregularity?

Question

My C is a bit shaky, but I've looked at python's source code and it looks like most of python's re module is implemented by state machines. This comes as no surprise since regular expressions can be reduced to deterministic finite state machines.

I imagine other regex implementations are similar. But few, if any, modern regex implementations are regular according to the textbook definition. Then how do they account for irregularity, like backreferences?

(.*)\1   // this is not regular

Fred Foo · Accepted Answer · 2011-11-20T23:29:17.843

They use an amended (beyond finite state) automaton class to account for this, and more complicated matching algorithms than the vanilla Thomson algorithm. You're very lucking if you find a formal characterization of the automaton class that any particular "RE" engine supports.

From what I can make up from the Python re source code, it stores the group in a buffer (it has to anyway, since it must return this as part of the match object) and does a straightforward string match in the matching algorithm, consuming as many characters as there are in the group match buffer.

[Optional rant: unfortunately, RE engines in practice are collections of hacks on top of NFAs that destroy their mathematical properties. Many implementers ignore the elegant algebra of regular languages and their powerful extension to regular relations, which can be efficiently captured by FSTs.]

How do regex engines account for irregularity?

1 Answers1