2

What is a Perl regex that can replace select text that is not part of an anchor tag? For example I would like to replace only the last "text" in the following code.

blah <a href="http://www.text.com"> blah text blah </a> blah text blah.

Thanks.

brian d foy
  • 129,424
  • 31
  • 207
  • 592
zylstra
  • 740
  • 1
  • 8
  • 22
  • 3
    gulp. Regex and html. goes to hide... – Sam Holder Jan 25 '10 at 10:12
  • Aren't the first and last two "blahs" also "not part of an anchor tag?" – Jay Jan 25 '10 at 10:12
  • @Jay - I assume the OP wants to `magic_replace(html, 'text', 'link still ok')` – Kobi Jan 25 '10 at 10:19
  • 1
    @Jay: Presumably he's doing `s/text/replacement/g`, so the blahs don't match. But this is not a job for a regex (alone). – cjm Jan 25 '10 at 10:21
  • 2
    Ah... got it. Yes, refer to the seminal text on the subject: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Jay Jan 25 '10 at 10:29
  • 2
    It is said that in Ulthar, which lies beyond the river Skai, no man may parse html with a regex. – daotoad Jan 25 '10 at 18:31

3 Answers3

8

You don't want to try to parse HTML with a regex. Try HTML::TreeBuilder instead.

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_file('file.html');
# or some other method, depending on where your HTML is

doReplace($html);

sub doReplace
{
  my $elt = shift;

  foreach my $node ($elt->content_refs_list) {
    if (ref $$node) {
      doReplace($$node) unless $$node->tag eq 'a';
    } else {
      $$node =~ s/text/replacement/g;
    } # end else this is a text node
  } # end foreach $node

} # end doReplace
cjm
  • 61,471
  • 9
  • 126
  • 175
1

I have temporarily prevailed:

$html =~ s|(text)([^<>]*?<)(?!\/a>)|replacement$2|is;

but I was dispirited, dismayed, and enervated by the seminal text; and so shall pursue Treebuilder in subsequent endeavors.

Community
  • 1
  • 1
zylstra
  • 740
  • 1
  • 8
  • 22
  • Use of regex html parsers will cause you to wind up like Charles Dexter Ward. – daotoad Jan 25 '10 at 18:28
  • Your regex will also replace the "text" inside `text`, because it only looks at the first end tag. – cjm Jan 25 '10 at 19:41
  • it depends on what you're parsing - if they are small, regular lines of HTML output by another process for example, then a regex might be appropriate. if they are actual full HTML pages, then a proper HTML parser makes sense... – plusplus Jan 26 '10 at 11:01
0

Don't use regexps for this kind of stuff. Use some proper HTML parser, and simply use plain regexp for parts of html that you're interested in.