I am having a problem with a non-greedy regular expression (regex). I've seen that there are questions regarding non-greedy regex, but they don't answer my problem.
Problem: I am trying to match the href of the "lol" anchor.
Note: I know this can be done with Perl HTML parsing modules, and my question is not about parsing HTML in Perl. My question is about the regular expression itself and the HTML is just an example.
Test case: I have four tests for .*?
and [^"]
. The two first produce the expected result. However the third doesn't and the fourth just does, but I don't understand why.
- Why does the third test fail in both tests for
.*?
and[^"]
? Shouldn't the non-greedy operator work? - Why does the fourth test work in both tests for
.*?
and[^"]
? I don't understand why including a.*
in front changes the regex (the third and fourth tests are the same except the.*
in front).
I probably don't understand exactly how these regex work. A Perl Cookbook recipe mentions something, but I don't think it answers my question.
use strict;
my $content=<<EOF;
<a href="/hoh/hoh/hoh/hoh/hoh" class="hoh">hoh</a>
<a href="/foo/foo/foo/foo/foo" class="foo">foo </a>
<a href="/bar/bar/bar/bar/bar" class="bar">bar</a>
<a href="/lol/lol/lol/lol/lol" class="lol">lol</a>
<a href="/koo/koo/koo/koo/koo" class="koo">koo</a>
EOF
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="(.*?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nWhy does not the 2nd non-greedy '?' work?\n"
if $content =~ m~href="(.*?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nIt now works if I put the '.*' in the front?\n"
if $content =~ m~.*href="(.*?)".*?>lol~s ;
print "\n###################################################\n";
print "Let's try now with [^]";
print "\n###################################################\n\n";
print "| $1 | \n\nThat's ok\n" if $content =~ m~href="([^"]+?)"~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThat's ok.\n" if $content =~ m~href="([^"]+?)".*>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nThe 2nd greedy still doesn't work?\n"
if $content =~ m~href="([^"]+?)".*?>lol~s ;
print "\n---------------------------------------------------\n";
print "| $1 | \n\nNow with the '.*' in front it does.\n"
if $content =~ m~.*href="([^"]+?)".*?>lol~s ;