7

I have the (what I believe to be) negative lookahead assertion <@> *(?!QQQ) that I expect to match if the tested string is a <@> followed by any number of spaces (zero including) and then not followed by QQQ.

Yet, if the tested string is <@> QQQ the regular expression matches.

I fail to see why this is the case and would appreciate any help on this matter.

Here's a test script

use warnings;
use strict;

my @strings = ('something <@> QQQ',
               'something <@> RRR',
               'something <@>QQQ' ,
               'something <@>RRR' );


print "$_\n" for map {$_ . " --> " . rep($_) } (@strings);



sub rep {

  my $string = shift;

  $string  =~ s,<@> *(?!QQQ),at w/o ,;
  $string  =~ s,<@> *QQQ,at w/  QQQ,;

  return $string;
}

This prints

something <@> QQQ --> something at w/o  QQQ
something <@> RRR --> something at w/o RRR
something <@>QQQ --> something at w/  QQQ
something <@>RRR --> something at w/o RRR

And I'd have expected the first line to be something <@> QQQ --> something at w/ QQQ.

tchrist
  • 78,834
  • 30
  • 123
  • 180
René Nyffenegger
  • 39,402
  • 33
  • 158
  • 293

3 Answers3

10

It matches because zero is included in "any number". So no spaces, followed by a space, matches "any number of spaces not followed by a Q".

You should add another lookahead assertion that the first thing after your spaces is not itself a space. Try this (untested):

 <@> *(?!QQQ)(?! )

ETA Side note: changing the quantifier to + would have helped only when there's exactly one space; in the general case, the regex can always grab one less space and therefore succeed. Regexes want to match, and will bend over backwards to do so in any way possible. All other considerations (leftmost, longest, etc) take a back seat - if it can match more than one way, they determine which way is chosen. But matching always wins over not matching.

Mark Reed
  • 91,912
  • 16
  • 138
  • 175
  • 3
    `(?=\S)` should be `(?=[^ ])` (in case the next character is a tab). Actually, it should be `(?! )` (in case it's the end of the string). – ikegami Apr 27 '12 at 15:20
7
$string  =~ s,<@> *(?!QQQ),at w/o ,;
$string  =~ s,<@> *QQQ,at w/  QQQ,;

One problem of yours here is that you are viewing the two regexes separately. You first ask to replace the string without QQQ, and then to replace the string with QQQ. This is actually checking the same thing twice, in a sense. For example: if (X==0) { ... } elsif (X!=0) { ... }. In other words, the code may be better written:

unless ($string =~ s,<@> *QQQ,at w/  QQQ,) {
    $string =~ s,<@> *,at w/o,;
}

You always have to be careful with the * quantifier. Since it matches zero or more times, it can also match the empty string, which basically means: it can match any place in any string.

A negative look-around assertion has a similar quality, in the sense that it needs to only find a single thing that differs in order to match. In this case, it matches the part "<@> " as <@> + no space + space, where space is of course "not" QQQ. You are more or less at a logical impasse here, because the * quantifier and the negative look-ahead counter each other.

I believe the correct way to solve this is to separate the regexes, like I showed above. There is no sense in allowing the possibility of both regexes being executed.

However, for theoretical purposes, a working regex that allows both any number of spaces, and a negative look-ahead would need to be anchored. Much like Mark Reed has shown. This one might be the simplest.

<@>(?! *QQQ)        # Add the spaces to the look-ahead

The difference is that now the spaces and Qs are anchored to each other, whereas before they could match separately. To drive home the point of the * quantifier, and also solve a minor problem of removing additional spaces, you can use:

<@> *(?! *QQQ)

This will work because either of the quantifiers can match the empty string. Theoretically, you can add as many of these as you want, and it will make no difference (except in performance): / * * * * * * */ is functionally equivalent to / */. The difference here is that spaces combined with Qs may not exist.

Community
  • 1
  • 1
TLP
  • 66,756
  • 10
  • 92
  • 149
4

The regex engine will backtrack until it finds a match, or until finding a match is impossible. In this case, it found the following match:

                         +--------------- Matches "<@>".
                         |   +----------- Matches "" (empty string).
                         |   |       +--- Doesn't match " QQQ".
                         |   |       |
                        --- ----    ---
'something <@> QQQ' =~ /<@> [ ]* (?!QQQ)/x

All you need to do is shuffle things around. Replace

/<@>[ ]*(?!QQQ)/

with

/<@>(?![ ]*QQQ)/

Or you can make it so the regex will only match all the spaces:

/<@>[ ]*+(?!QQQ)/
/<@>[ ]*(?![ ]|QQQ)/
/<@>[ ]*(?![ ])(?!QQQ)/

PS — Spaces are hard to see, so I use [ ] to make them more visible. It gets optimised away anyway.

ikegami
  • 367,544
  • 15
  • 269
  • 518
  • the addition of `+` fixes the match, but I can't tell why. – flies Apr 27 '12 at 15:28
  • wait, i think i've got it. `[ ]*+` ensures that all available spaces are grabbed even if it breaks the match, whereas `[ ]*` will grab as many as it can without breaking the match. – flies Apr 27 '12 at 15:36
  • @flies, Because `" " =~ / *+/` can only match `" "`. It won't backtrack to match `""`, so it can no longer find the match `/ */` does. – ikegami Apr 27 '12 at 15:39
  • `/ *+/` should mean "find zero or more spaces, one or more times", how does that work, exactly? Something about `+` being greedy and using up excess space? – TLP Apr 27 '12 at 16:57
  • @TLP, No, when `+` applies to a quantifier (e.g. `*`), it's prevents backtracking through that quantifier. (Kinda like how `?` can modifiy the greediness of `*`.) `/ *+/` is the same thing as `/(?> *)/`. – ikegami Apr 27 '12 at 17:33