upgraded from perl 5.8 (32bit) to 5.16 (64bit) - regex performance hit

Question

I'm running a series of regexes against blocks of data. We recently upgraded from Activestate perl 5.8 32bit (I know... extremely old!) to perl 5.16 64bit. All the hardware stayed the same (windows).

We are noticing a performance hit where as before our parse loop would take about 2.5 seconds, now it takes about 5 seconds. Can anybody give me a hint as to what would cause the change? I was expecting an increase in performance as my understanding was that the engine had improved greatly, any docs on what I should be doing different would be greatly appreciated.

time just reading the data without doing any processing, and see how much of the difference is from that — ysth, Jul 23 '13 at 01:38
I've seen some questions/comments about regex in "newer" versions of Perl being slower due to implementing unicode more correctly than older versions of Perl (or something). Don't know how much truth there is in that but maybe you could look into that. — Qtax, Jul 23 '13 at 10:05

score 7 · Accepted Answer · answered Jul 23 '13 at 10:56

Yes, the regex engine improved greatly after v8. Alone in v10, we saw:

pattern recursion
named captures
possessive quantifiers
backtrack control verbs like (*FAIL) or (*SKIP).
The \K operator
… and some more

Also, more internals were made Unicode-aware.

In v12, the Unicode support was cleaned up. The \p and \X operators in regexes are now greatly enhanced.

In v14, the Unicode support was bumped to 6.0. Charnames for the \N operator were improved (see also charnames pragma). The new character model can treat any unsigned integer as a codepoint. In the regex engine,

regexes can now carry charclass modifiers like /u, /d, /l, /a, /aa.
Non-destructive susbtitution with /r was implemented.
The RE engine is now reentrant, so embedded code can use regexes.
\p was cleaned up
regex compilation is faster when a switch to unicode semantics is neccessary.

In v16, perl almost supports Unicode 6.1. In the regex engine,

efficiency of \p charclasses was increased.
Various regex bugs (often involving case-insensitive matching) were fixed.

Obviously, not all of these features come at a price, but especially Unicode-awareness makes internals more complicated, and slower.

You also cannot waive a hand and state that the execution time of a script doubled from perl5 v8 x86 to perl5 v16 x64; there are too many variables:

were both Perls compiled with the same flags?
- are both perls threaded perls (disabling threading support makes it faster)
- how big are your integers? 64 bit or 32 bit?
- what compiler optimizations were chosen?
did your previous Perl have some distribution-specific patches applied?

Basically, you have to compare the whole perl -V output.

If you are hitting a performance ceiling with regexes, they may be the wrong tool for extensive parsing. At the very least, you may use the newer features to optimize the regexes to eliminate some backtracking.

If your parsing code describes a (roughly) context-free language (i.e. you don't use (?{...}), (?=...) or related regex features), and parsing means doing something like generating a tree, then Marpa::R2 might speed things up considerably.

Thanks amon. This is a great summary of some items we can examine, I really appreciate you taking the time to write it out! — sniperd, Jul 29 '13 at 14:30

score 0 · Answer 2 · answered Jul 23 '13 at 23:31

0

If you are looking for better performance you may also want to make sure that a regex is what you want. You didn't specify what kind of regexes your system was using but often you can replace a regex with a built-in function.

Examples:

if (lc($name) eq 'bob') { $bob_count++ }  #Faster
if ($name =~ /^bob$/i)  { $bob_count++ }  #Slower

my $sentiment = "I don't like beans.";
substr($sentiment, 13, 5) = 'broccoli';   #Faster
$sentiment = "I don't like beans.";
$sentiment =~ s/beans/broccoli/;          #Slower

These examples, as well as unpack, and index, might not apply to your code, but if they do you should benchmark them and see if it helps with performance.

answered Jul 23 '13 at 23:31

dms

797
1
5
15

I understood the question to be about performance degradation after a perl upgrade. Your answer does not mention anything that could have changed between versions. Instead, you provide general optimization tips with little connection to the question. (Your substitution example isn't even equivalent unless the contents of `$sentiment` are known before execution. You probably meant `if (0 <= (my $i = index $sentiment, "beans")) { substr $sentiment, $i, length "beans", "brocolli" }`) – amon Jul 23 '13 at 23:43
I read between the lines a bit to see that no matter what the reason for the slowdown, the question asker will presumably still have a performance problem. My answer pointed out that sometimes performance problems can be solved by eliminating regexes. That is directly connected to the question. I tried to make it clear that these examples would have to be tested on the specific code in question. So, yes, it is unlikely that his code includes a variable that contains the string "I don't like beans.". I thought that was too obvious to mention. – dms Jul 24 '13 at 02:56

upgraded from perl 5.8 (32bit) to 5.16 (64bit) - regex performance hit

2 Answers2