longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?
longest first
>>> p = re.compile('supermanutd|supermanu|superman|superm|super')
shortest first
>>> p = re.compile('super|superm|superman|supermanu|supermanutd')
Why is the longest first regex preferred?
Alternatives in Regexes are tested in order you provide, so if first branch matches, then Rx doesn't check other branches. This doesn't matter if you only need to test for match, but if you want to extract text based on match, then it matters.
You only need to sort by length when your shorter strings are substrings of longer ones. For example when you have text:
supermanutd
supermanu
superman
superm
then with your first Rx you'll get:
>>> regex.findall(string)
[u'supermanutd', u'supermanu', u'superman', u'superm']
but with second Rx:
>>> regex.findall(string)
[u'super', u'super', u'super', u'super', u'super']
Test your regexes with http://www.pythonregex.com/
As @MBO says, alternatives are tested in the order they are written, and once one of them matches, the RE engine goes on to what comes after.
This behaviour is common to Perl-like RE engines, and ultimately goes back to the 1985 Bell Labs design of the RE library for Edition 8 Unix.
Note that POSIX 2 (from 1991) has another definition, insisting on the leftmost longest match for the whole RE and subject to that, for each subexpression in turn (in lexical order). In POSIX 2, order of alternatives does not matter.
However, the difference in behaviour is often: irrelevant (if you're just testing), masked by backtracking (if the shorter match causes the rest of the RE to fail), or compensated by the rest of the RE matching the part that the longer match 'should have' -- so most people aren't aware of it.
I'd guess it's because they're matched in that order, and it's faster to match shorter substrings. As an extreme example, a match against a single letter | a huge string will perform much better if the single letter (which is probably going to be responsible for the majority of matches anyway) is tested against first.
But in practice you should measure, not guess. If you need to have a performant regexp, test variations against representative test data.
The advice to which you refer is contingent on the regex engine attempting to match the components of the alternation in strictly left-to-right order, as is documented for the Python re module.
Sorting substrings in descending length order is just a special case of a wider problem when you are trying to extract a series of tokens. The general principle is that you put the more specialised sub-regexes first. For example, you are writing the lexical analysis for a formula parser. You have a "float constant" subregex and an "int constant" subregex. Your first attempt at the float subregex is likely to also match int constants. If so, you have two choices: (1) write a more complicated float subregex that doesn't match int constants (2) put your int subregex first.