Why is lookahead (sometimes) faster than capturing?

Question

This question is inspired by this other one.

Comparing s/,(\d)/$1/ to s/,(?=\d)//: the former uses a capture group to replace only the digit but not the comma, the latter uses a lookahead to determine whether the comma is succeeded by a digit. Why is the latter sometimes faster, as discussed in this answer?

Doing some benchmarking tests on the two regexes I cannot really determine any great difference. Both are very fast. Note that that applies to these regexes, not capturing vs lookahead. — TLP, Dec 03 '12 at 12:45
Its obvious: capture group force to copy data and then on replace needs interpolation of `$1`, while second regex is just find/check/remove. However, difference in speed should be invisible. — Galimov Albert, Dec 03 '12 at 13:40

score 4 · Accepted Answer · answered Dec 03 '12 at 13:55

The two approaches do different things and have different kinds of overhead costs. When you capture, perl has to make a copy of the captured text. Look-ahead matches without consuming; it has to mark the location where it starts. You can see what's happening by using the re 'debug' pragma:

use re 'debug';
my $capture = qr/,(\d)/;

Compiling REx ",(\d)"
Final program:
   1: EXACT  (3)
   3: OPEN1 (5)
   5:   DIGIT (6)
   6: CLOSE1 (8)
   8: END (0)
anchored "," at 0 (checking anchored) minlen 2 
Freeing REx: ",(\d)"

use re 'debug';
my $lookahead = qr/,(?=\d)/;

Compiling REx ",(?=\d)"
Final program:
   1: EXACT  (3)
   3: IFMATCH[0] (8)
   5:   DIGIT (6)
   6:   SUCCEED (0)
   7: TAIL (8)
   8: END (0)
anchored "," at 0 (checking anchored) minlen 1 
Freeing REx: ",(?=\d)"

I'd expect look-ahead to be faster than capturing in most cases, but as noted in the other thread regex performance can be data dependent.

I should have thought of the `re` pragma. Thanks! – mpe Dec 03 '12 at 16:54 — mpe, Dec 03 '12 at 16:54

Matthias · Answer 2 · 2012-12-06T16:36:55.797

As always, when you want to know which of two pieces of code works faster, you have to test it:

#!/usr/bin/perl

use 5.012;
use warnings;
use Benchmark qw<cmpthese>;

say "Extreme ,,,:";
my $Text = ',' x (my $LEN = 512);
cmpthese my $TIME = -10, my $CMP = {
    capture => \&capture,
    lookahead => \&lookahead,
};

say "\nExtreme ,0,0,0:";
$Text = ',0' x $LEN;
cmpthese $TIME, $CMP;

my $P = 0.01;
say "\nMixed (@{[$P * 100]}% zeros):";
my $zeros = $LEN * $P;
$Text = ',' x ($LEN - $zeros) . ',0' x $zeros;
cmpthese $TIME, $CMP;

sub capture {
    local $_ = $Text;
    s/,(\d)/$1/;
}

sub lookahead {
    local $_ = $Text;
    s/,(?=\d)//;
}

The benchmark tests three different cases:

Only ','
Only ',0'
1% ',0', rest ','

On my machine and with my perl version, it produces these results:

Extreme ,,,:
             Rate   capture lookahead
capture   23157/s        --       -1%
lookahead 23362/s        1%        --

Extreme ,0,0,0:
               Rate   capture lookahead
capture    419476/s        --      -65%
lookahead 1200465/s      186%        --

Mixed (1% zeros):
             Rate   capture lookahead
capture   22013/s        --       -4%
lookahead 22919/s        4%        --

These results substantiates the assumption that the look-ahead version is significantly faster than the capturing, except for the case of almost only commas. And it is indeed not very surprising as PSIAlt already explained in his comment.

regards, Matthias

I know how to test this and in fact, have done so myself. Thanks for demonstrating, though; your results are very illustrative! — mpe, Dec 03 '12 at 16:56

Why is lookahead (sometimes) faster than capturing?

2 Answers2

Linked

Related