What is the reason behind the advice that the substrings in regex should be ordered based on length?

Question

longest first

>>> p = re.compile('supermanutd|supermanu|superman|superm|super')

shortest first

>>> p = re.compile('super|superm|superman|supermanu|supermanutd')

Why is the longest first regex preferred?

score 5 · Accepted Answer · answered Apr 26 '11 at 08:25

5

Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.

You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:

supermanutd
supermanu
superman
superm

then with your first Rx you'll get:

>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']

but with second Rx:

>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']

Test your regexes with http://www.pythonregex.com/

answered Apr 26 '11 at 08:25

MBO

30,379
5
50
52

That's not the way regexp ORs work. You'd have a state machine with transitions for every character in the input string. When the input string is exhausted, the engine checks to see if it is in an accepting state and if so, returns a match. In this case, if your input string is superman, you *would* have to continue even after the initial super is matched. – Noufal Ibrahim Apr 26 '11 at 08:32
1

@Noufal: the exact description of the regex engine might be off, but the behaviour is exactly as described. – Joachim Sauer Apr 26 '11 at 09:20
@Noufal Ibrahim: TFM says: """As the target string is scanned, REs separated by '|' are tried from left to right. When one pattern completely matches, that branch is accepted. This means that once A matches, B will not be tested further, even if it would produce a longer overall match. In other words, the '|' operator is never greedy.""" Sure ain't no state machine behaves like that :-) – John Machin Apr 26 '11 at 12:57

score 2 · Answer 2 · answered Apr 26 '11 at 12:23

As @MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.

However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.

score 0 · Answer 3 · answered Apr 26 '11 at 08:23

I'd guess it's because they're matched in that order, and it's faster to match shorter substrings. As an extreme example, a match against a single letter | a huge string will perform much better if the single letter (which is probably going to be responsible for the majority of matches anyway) is tested against first.

But in practice you should measure, not guess. If you need to have a performant regexp, test variations against representative test data.

score 0 · Answer 4 · answered Apr 27 '11 at 01:18

The advice to which you refer is contingent on the regex engine attempting to match the components of the alternation in strictly left-to-right order, as is documented for the Python re module.

Sorting substrings in descending length order is just a special case of a wider problem when you are trying to extract a series of tokens. The general principle is that you put the more specialised sub-regexes first. For example, you are writing the lexical analysis for a formula parser. You have a "float constant" subregex and an "int constant" subregex. Your first attempt at the float subregex is likely to also match int constants. If so, you have two choices: (1) write a more complicated float subregex that doesn't match int constants (2) put your int subregex first.

What is the reason behind the advice that the substrings in regex should be ordered based on length?

4 Answers4

Linked