5

I am having a problem with a non-greedy regular expression (regex). I've seen that there are questions regarding non-greedy regex, but they don't answer my problem.

Problem: I am trying to match the href of the "lol" anchor.

Note: I know this can be done with Perl HTML parsing modules, and my question is not about parsing HTML in Perl. My question is about the regular expression itself and the HTML is just an example.

Test case: I have four tests for .*? and [^"]. The two first produce the expected result. However the third doesn't and the fourth just does, but I don't understand why.

  1. Why does the third test fail in both tests for .*? and [^"]? Shouldn't the non-greedy operator work?
  2. Why does the fourth test work in both tests for .*? and [^"]? I don't understand why including a .* in front changes the regex (the third and fourth tests are the same except the .* in front).

I probably don't understand exactly how these regex work. A Perl Cookbook recipe mentions something, but I don't think it answers my question.

use strict;

my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
  if $content =~ m~href="(.*?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
  if $content =~ m~.*href="(.*?)".*?>lol~s ;

print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";


print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
  if $content =~ m~href="([^"]+?)".*?>lol~s ;

print "\n---------------------------------------------------\n";

print "| $1 | \n\nNow with the '.*' in front it does.\n"
  if $content =~ m~.*href="([^"]+?)".*?>lol~s ;
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
vkats
  • 117
  • 1
  • 8
  • You state a problem and say that there is a solution that produces an expected result. I'm not sure what the question is. – musiKk May 14 '11 at 10:00
  • You are right, I wasn't precise enough. I edited and stated the question more clearly. – vkats May 14 '11 at 10:13

4 Answers4

6

Try printing out $& (the text matched by the entire regex) as well as $1. This may give you a better idea of what's happening.

The problem you seem to have is that .*? does not mean "Find the match out of all possible matches that uses the fewest characters here." It just means "First, try matching 0 characters here, and go on to match the rest of the regex. If that fails, try matching 1 character. If the rest of the regex won't match, try 2 characters here. etc."

Perl will always find the match that starts closest to the beginning of the string. Since most of your patterns start with href=, it will find the first href= in the string and see if there's any way to expand the repetitions to get a match beginning there. If it can't get a match, it'll try starting at the next href=, and so on.

When you add a greedy .* to the beginning of the regex, matching starts with the .* grabbing as many characters as it can. Perl then backtracks to find a href=. Essentially, this causes it to try the last href= in the string first, and work towards the beginning of the string.

cjm
  • 61,471
  • 9
  • 126
  • 175
  • Thanks that seems to be the issue. It explains the first matching and the backtracking well. – vkats May 14 '11 at 10:36
  • A good thing to keep in mind is that greedy/non-greedy never changes whether the match will succeed or fail. If it succeeds greedy it will succeed non-greedy. If it fails greedy it will fail non-greedy. Greediness only comes into play when there is more than one way to match at the current position (going from left to right). In that case greedy matches the longest of the possible matches at that point, while non-greedy matches the shortest of the possible matches at that point. – tadmc May 14 '11 at 14:03
  • 1
    @cjm: Thank you, this is the first answer I see on the subject that is an actual answer on why it doesn't work and how to make it work. In other questions and answers with the same problem, people just offer a different solution, not a real answer. – Francisco Zarabozo Apr 03 '13 at 10:26
0

Only the fourth test case is working.

The first: m~href="(.*?)"~s

This will match the first href within your string and capture what is between the quotes so: /hoh/hoh/hoh/hoh/hoh

The second: m~href="(.*?)".*>lol~s

This will match the first href within your string and capture what is between the quotes. Then it matches any number of any character until it finds >lol so: /hoh/hoh/hoh/hoh/hoh

Try capturing the .* with m~href="(.*?)"(.*)>lol~s

$1 contains:
/hoh/hoh/hoh/hoh/hoh
$2 contains: 
class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol" 

The third: m~href="(.*?)".*?>lol~s

The same result as the previous test case.

The fourth: m~.*href="(.*?)".*?>lol~s

This will match any number of any character, then href=", then capture any number of any character non-greedy until the quote, and then match any any number of any character until it finds >lol so: /lol/lol/lol/lol/lol

Try capturing all the .* with m~(.*)href="(.*?)"(.*?)>lol~s

$1 contains:
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a
$2 contains: 
/lol/lol/lol/lol/lol
$3 contains:
class="lol"

Have a look at this site it explains what your regexes are doing.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Toto
  • 89,455
  • 62
  • 89
  • 125
  • Thanks for answering. You mention **what** happens (I already understand this), but not **why** it happens. Maybe my question wasn't written clearly, so I edited it. – vkats May 14 '11 at 10:15
  • @vkats: I would say because regex work this way :-) . It's trying to match the very first occurrence of what you're searching for. – Toto May 14 '11 at 10:21
  • I know it tries to match what I told it to match. Obviously, I don't understand what I told it match, and that's what I try to do. – vkats May 14 '11 at 10:28
0

The main problem is that you are using non-greedy regexes when you shouldn't. The second problem is using . with * which can accidentally match more that you intended to. The s flag you are using makes . even more matching.

Use:

m~href="([^"]+)"[^>]*>lol~

for your case. And about non-greedy regexes, consider that code:

$_ = "xaaaaab xaaac xbbc";
m~^x.+?c~;

It would not match 'xaaac' as you might expect. It will start from the beginning of the string and match 'xaaaaab xaaac'. A greedy variant would match the whole string.

The point is that though non-greedy regexes don't try to grab as much as they can, they still try to match somehow with the same eagerness as their greedy brothers. And they will grab whatever part of a string to do it.

You may also consider the "possessive" quantifier, which turns off backtracking.

Also, cookbooks are good to start, but if you want to understand how things really work you should read this - perlre.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Suor
  • 2,845
  • 1
  • 22
  • 28
  • Thanks for the answer (it agrees with another given some seconds :) ago ). I have forgotten that the match begins at the left. – vkats May 14 '11 at 10:49
0

Let me try to illustrate what's going on here (see other answers why it happens):

href="(.*?)"

Match: href="/hoh/hoh/hoh/hoh/hoh"
Group: /hoh/hoh/hoh/hoh/hoh

href="(.*?)".*>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

href="([^"]+?)".*?>lol

Match: href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /hoh/hoh/hoh/hoh/hoh

.*href="(.*?)".*?>lol

Match: <a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a> <a href="/foo/foo/foo/foo/foo" class="foo">foo </a> <a href="/bar/bar/bar/bar/bar" class="bar">bar</a> <a href="/lol/lol/lol/lol/lol" class="lol">lol

Group: /lol/lol/lol/lol/lol

One way to write the regular expression you want is to use: href="[^"]*"[^>]*>lol

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
gangabass
  • 10,607
  • 2
  • 23
  • 35
  • Indeed your proposition `href="[^"]*"[^>]*>lol` works. Does `href="[^"]+"[^>]+>lol` (with `+` instead of `*`) alter the meaning? – vkats May 14 '11 at 11:36
  • @vkats it work fine for me. I use `*` instead of `+` because of `href="">lol` – gangabass May 14 '11 at 11:51