5

I'm looking for a regular expression that will behave as follows:

input: "hello world."

output: he, el, ll, lo, wo, or, rl, ld

my idea was something along the lines of

    while($string =~ m/(([a-zA-Z])([a-zA-Z]))/g) {
        print "$1-$2 ";
    }

But that does something a little bit different.

Scott Berrevoets
  • 16,921
  • 6
  • 59
  • 80
Johann
  • 273
  • 1
  • 6
  • 11
  • Nice question. I think I may have answered it before. [Search](http://stackoverflow.com/search?q=code%3A%22%28*FAIL%29%22) for `(*FAIL)`. – tchrist Mar 07 '13 at 18:54

5 Answers5

10

It's tricky. You have to capture it, save it, and then force a backtrack.

You can do that this way:

use v5.10;   # first release with backtracking control verbs

my $string = "hello, world!";
my @saved;

my $pat = qr{
    ( \pL {2} )
    (?{ push @saved, $^N })
    (*FAIL)
}x;

@saved = ();
$string =~ $pat;
my $count = @saved;
printf "Found %d matches: %s.\n", $count, join(", " => @saved);

produces this:

Found 8 matches: he, el, ll, lo, wo, or, rl, ld.

If you do not have v5.10, or you have a headache, you can use this:

my $string = "hello, world!";
my @pairs = $string =~ m{
  # we can only match at positions where the
  # following sneak-ahead assertion is true:
    (?=                 # zero-width look ahead
        (               # begin stealth capture
            \pL {2}     #       save off two letters
        )               # end stealth capture
    )
  # succeed after matching nothing, force reset
}xg;

my $count = @pairs;
printf "Found %d matches: %s.\n", $count, join(", " => @pairs);

That produces the same output as before.

But you might still have a headache.

tchrist
  • 78,834
  • 30
  • 123
  • 180
  • Nice answer. Can you please explain the answer in brief? – Krishnachandra Sharma Mar 07 '13 at 18:56
  • 1
    @KrishnachandraSharma Hm, I thought I had. You capture it, save it in an external variable, then force a failure. The failure backtracks, the engine steps forward one because it made no progress, and we try again. Lather, rinse, repeat. – tchrist Mar 07 '13 at 18:58
  • Great approach and answer, but this would a problem in older versions of Perl than 5.10. – Krishnachandra Sharma Mar 07 '13 at 19:01
  • Thanks, thought there would be a more straight forward way, but it gets the job done. – Johann Mar 07 '13 at 19:12
  • Always use `local our @var` instead of `my @var` for vars declared outside of `(?{ })` but used inside. Your code won't work inside a sub. – ikegami Mar 07 '13 at 19:15
  • @KrishnachandraSharma Perl v5.10 came out in 2007. Do you know how many children have been born since then? – tchrist Mar 07 '13 at 19:15
  • @tchrist Still, lower version are not obsolete right? I simply made my point! – Krishnachandra Sharma Mar 07 '13 at 19:18
  • @ikegami You're probably right, but I don't get any will-not-stay-shared type warnings. Care to explain? Is this related to the trick of using `local` *inside* the backtracking part so it pops things off? I don't want that to happen here. – tchrist Mar 07 '13 at 19:18
  • 1
    `(?{ })` captures when the regex is compiled, which is at compile time. So that makes it like a named sub. There's really no problem until you move it into a sub, but when you do, you'll get `Variable "@saved" is not available` – ikegami Mar 07 '13 at 19:21
5

No need "to force backtracking"!

push @pairs, "$1$2" while /([a-zA-Z])(?=([a-zA-Z]))/g;

Though you might want to match any letter rather than the limited set you specified.

push @pairs, "$1$2" while /(\pL)(?=(\pL))/g;
ikegami
  • 367,544
  • 15
  • 269
  • 518
1

Yet another way to do it. Doesn't use any regexp magic, it does use nested maps but this could easily be translated to for loops if desired.

#!/usr/bin/env perl

use strict;
use warnings;

my $in = "hello world.";
my @words = $in =~ /(\b\pL+\b)/g;

my @out = map {
  my @chars = split '';
  map { $chars[$_] . $chars[$_+1] } ( 0 .. $#chars - 1 );
} @words;

print join ',', @out;
print "\n";

Again, for me this is more readable than a strange regex, YMMV.

Joel Berger
  • 20,180
  • 5
  • 49
  • 104
0

I would use captured group in lookahead..

(?=([a-zA-Z]{2}))
    ------------
         |->group 1 captures two English letters 

try it here

Anirudha
  • 32,393
  • 7
  • 68
  • 89
  • Works nicely. Great answer. – Krishnachandra Sharma Mar 07 '13 at 19:09
  • What is an "alphabet"? Greek is an alphabet. Latin is an alphabet. Cyrillic is an alphabet. That does not capture "alphabets"; that captures two upper- or lowercase letters between A and Z inclusive. It fails on things like *façade*. – tchrist Mar 07 '13 at 19:20
  • @tchrist op never mentioned that..a-zA-Z are still alphabets – Anirudha Mar 07 '13 at 19:22
  • @Some1.Kill.The.DJ Um, that is not what *alphabet* means in English. I suggest you check a dictionary. – tchrist Mar 07 '13 at 19:23
  • @tchrist hope **English alphabets** makes it more clear to you :P – Anirudha Mar 07 '13 at 19:30
  • @Some1.Kill.The.DJ Aren't use listening? That is not what alphabet means. PLEASE see a dictionary. The word *alphabet* is not a synonym for "an alphabetic character". – tchrist Mar 07 '13 at 19:31
0

You can do this by looking for letters and using the pos function to make use of the position of the capture, \G to reference it in another regex, and substr to read a few characters from the string.

use v5.10;
use strict;
use warnings;

my $letter_re = qr/[a-zA-Z]/;

my $string = "hello world.";
while( $string =~ m{ ($letter_re) }gx ) {
    # Skip it if the next character isn't a letter
    # \G will match where the last m//g left off.
    # It's pos() in a regex.
    next unless $string =~ /\G $letter_re /x;

    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

You can put the "check the next letter" logic inside the original regex with a zero-width positive assertion, (?=pattern). Zero-width meaning it is not captured and does not advance the position of a m//g regex. This is a bit more compact, but zero-width assertions get can get tricky.

while( $string =~ m{ ($letter_re) (?=$letter_re) }gx ) {
    # pos() is still where the last m//g left off.
    # Use substr to print the character before it (the one we matched)
    # and the next one, which we know to be a letter.
    say substr $string, pos($string)-1, 2;
}

UPDATE: I'd originally tried to capture both the match and the look ahead as m{ ($letter_re (?=$letter_re)) }gx but that didn't work. The look ahead is zero-width and slips out of the match. Other's answers showed that if you put a second capture inside the look-ahead then it can collapse to just...

say "$1$2" while $string =~ m{ ($letter_re) (?=($letter_re)) }gx;

I leave all the answers here for TMTOWTDI, especially if you're not a regex master.

Schwern
  • 153,029
  • 25
  • 195
  • 336
  • Do you really use `substr` and `pos`? Especially like this? I nearly never ever do. Part of it is because it is grapheme-hostile, but mostly it is because it's fiddly. – tchrist Mar 07 '13 at 19:57
  • @tchrist In this case, it seemed the simplest way to do it. `pos` and `substr` have the benefit of being straight forward to understand vs a regex with look-aheads and captures. I agree `pos` can change out from under you mysteriously, but small scopes forgive many sins. My first impulse was a non-regex iterative solution, since "match every pair of letters" is pretty simple if you loop through characters. I tried to do it with a capture + look-ahead, but couldn't make it work; didn't realize you need to put a second capture inside the look-ahead, not around it. – Schwern Mar 07 '13 at 22:20