Perl Regex Multiple Matches

Question

I'm looking for a regular expression that will behave as follows:

input: "hello world."

output: he, el, ll, lo, wo, or, rl, ld

my idea was something along the lines of

    while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
        print "$1-$2 ";
    }

But that does something a little bit different.

Nice question. I think I may have answered it before. [Search](http://stackoverflow.com/search?q=code%3A%22%28*FAIL%29%22) for `(*FAIL)`. — tchrist, Mar 07 '13 at 18:54

tchrist · Accepted Answer · 2013-03-07T19:14:07.257

10

It's tricky. You have to capture it, save it, and then force a backtrack.

You can do that this way:

use v5.10;   # first release with backtracking control verbs

my $string = "hello, world!";
my @saved;

my $pat = qr{
    ( \pL {2} )
    (?{ push @saved, $^N })
    (*FAIL)
}x;

@saved = ();
$string =~ $pat;
my $count = @saved;
printf "Found %d matches: %s.\n", $count, join(", " => @saved);

produces this:

Found 8 matches: he, el, ll, lo, wo, or, rl, ld.

If you do not have v5.10, or you have a headache, you can use this:

my $string = "hello, world!";
my @pairs = $string =~ m{
  # we can only match at positions where the
  # following sneak-ahead assertion is true:
    (?=                 # zero-width look ahead
        (               # begin stealth capture
            \pL {2}     #       save off two letters
        )               # end stealth capture
    )
  # succeed after matching nothing, force reset
}xg;

my $count = @pairs;
printf "Found %d matches: %s.\n", $count, join(", " => @pairs);

That produces the same output as before.

But you might still have a headache.

edited Mar 07 '13 at 19:14

answered Mar 07 '13 at 18:54

tchrist

78,834
30
123
180

Nice answer. Can you please explain the answer in brief? – Krishnachandra Sharma Mar 07 '13 at 18:56
1

@KrishnachandraSharma Hm, I thought I had. You capture it, save it in an external variable, then force a failure. The failure backtracks, the engine steps forward one because it made no progress, and we try again. Lather, rinse, repeat. – tchrist Mar 07 '13 at 18:58
Great approach and answer, but this would a problem in older versions of Perl than 5.10. – Krishnachandra Sharma Mar 07 '13 at 19:01
Thanks, thought there would be a more straight forward way, but it gets the job done. – Johann Mar 07 '13 at 19:12
Always use `local our @var` instead of `my @var` for vars declared outside of `(?{ })` but used inside. Your code won't work inside a sub. – ikegami Mar 07 '13 at 19:15
@KrishnachandraSharma Perl v5.10 came out in 2007. Do you know how many children have been born since then? – tchrist Mar 07 '13 at 19:15
@tchrist Still, lower version are not obsolete right? I simply made my point! – Krishnachandra Sharma Mar 07 '13 at 19:18
@ikegami You're probably right, but I don't get any will-not-stay-shared type warnings. Care to explain? Is this related to the trick of using `local` *inside* the backtracking part so it pops things off? I don't want that to happen here. – tchrist Mar 07 '13 at 19:18
1

`(?{ })` captures when the regex is compiled, which is at compile time. So that makes it like a named sub. There's really no problem until you move it into a sub, but when you do, you'll get `Variable "@saved" is not available` – ikegami Mar 07 '13 at 19:21

ikegami · Answer 2 · 2013-03-07T19:24:11.547

5

No need "to force backtracking"!

push @pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;

Though you might want to match any letter rather than the limited set you specified.

push @pairs, "$1$2" while /(\pL)(?=(\pL))/g;

edited Mar 07 '13 at 19:24

answered Mar 07 '13 at 19:13

ikegami

367,544
15
269
518

Not as efficient, for starters. – ikegami Mar 07 '13 at 19:17
That has the dreaded "`[A-Z]` code smell", you know. Good for avoiding jalapeños, bad for normal text. – tchrist Mar 07 '13 at 19:21
@tchrist, The op never said anything about letters, but added a mention of `\pL`. – ikegami Mar 07 '13 at 19:25
I always use `\pL` unless I'm working from an RFC that stipulates A-Z exclusively. I wish I remembered what the `(*FAIL)` is good for beyond the lookahead technique. – tchrist Mar 07 '13 at 19:58
@tchrist, Used in conditionals too `(?(?{ condition() })(?!))` – ikegami Mar 08 '13 at 07:03

score 1 · Answer 3 · answered Mar 07 '13 at 19:43

1

Yet another way to do it. Doesn't use any regexp magic, it does use nested maps but this could easily be translated to for loops if desired.

#!/usr/bin/env perl

use strict;
use warnings;

my $in = "hello world.";
my @words = $in =~ /(\b\pL+\b)/g;

my @out = map {
  my @chars = split '';
  map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} @words;

print join ',', @out;
print "\n";

Again, for me this is more readable than a strange regex, YMMV.

answered Mar 07 '13 at 19:43

Joel Berger

20,180
5
49
104

The more of these I read, the more I think ikegami's does what I attempt here with even more clarity. – Joel Berger Mar 07 '13 at 19:59

Anirudha · Answer 4 · 2013-03-07T19:29:51.770

0

I would use captured group in lookahead..

(?=([a-zA-Z]{2}))
    ------------
         |->group 1 captures two English letters

try it here

edited Mar 07 '13 at 19:29

answered Mar 07 '13 at 18:58

Anirudha

32,393
7
68
89

Works nicely. Great answer. – Krishnachandra Sharma Mar 07 '13 at 19:09
What is an "alphabet"? Greek is an alphabet. Latin is an alphabet. Cyrillic is an alphabet. That does not capture "alphabets"; that captures two upper- or lowercase letters between A and Z inclusive. It fails on things like *façade*. – tchrist Mar 07 '13 at 19:20
@tchrist op never mentioned that..a-zA-Z are still alphabets – Anirudha Mar 07 '13 at 19:22
@Some1.Kill.The.DJ Um, that is not what *alphabet* means in English. I suggest you check a dictionary. – tchrist Mar 07 '13 at 19:23
@tchrist hope **English alphabets** makes it more clear to you :P – Anirudha Mar 07 '13 at 19:30
@Some1.Kill.The.DJ Aren't use listening? That is not what alphabet means. PLEASE see a dictionary. The word *alphabet* is not a synonym for "an alphabetic character". – tchrist Mar 07 '13 at 19:31

Schwern · Answer 5 · 2013-03-07T22:24:24.283

You can do this by looking for letters and using the pos function to make use of the position of the capture, \G to reference it in another regex, and substr to read a few characters from the string.

use v5.10;
use strict;
use warnings;

my $letter_re = qr/[a-zA-Z]/;

my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
    # Skip it if the next character isn't a letter
    # \G will match where the last m//g left off.
    # It's pos() in a regex.
    next unless $string =~ /\G $letter_re /x;

    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

You can put the "check the next letter" logic inside the original regex with a zero-width positive assertion, (?=pattern). Zero-width meaning it is not captured and does not advance the position of a m//g regex. This is a bit more compact, but zero-width assertions get can get tricky.

while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

UPDATE: I'd originally tried to capture both the match and the look ahead as m{ ($letter_re (?=$letter_re)) }gx but that didn't work. The look ahead is zero-width and slips out of the match. Other's answers showed that if you put a second capture inside the look-ahead then it can collapse to just...

say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;

I leave all the answers here for TMTOWTDI, especially if you're not a regex master.

Do you really use `substr` and `pos`? Especially like this? I nearly never ever do. Part of it is because it is grapheme-hostile, but mostly it is because it's fiddly. — tchrist, Mar 07 '13 at 19:57
@tchrist In this case, it seemed the simplest way to do it. `pos` and `substr` have the benefit of being straight forward to understand vs a regex with look-aheads and captures. I agree `pos` can change out from under you mysteriously, but small scopes forgive many sins. My first impulse was a non-regex iterative solution, since "match every pair of letters" is pretty simple if you loop through characters. I tried to do it with a capture + look-ahead, but couldn't make it work; didn't realize you need to put a second capture inside the look-ahead, not around it. — Schwern, Mar 07 '13 at 22:20

Perl Regex Multiple Matches

5 Answers5