How can I replace text that is not part of an anchor tag in Perl?

Question

What is a Perl regex that can replace select text that is not part of an anchor tag? For example I would like to replace only the last "text" in the following code.

blah <a href="http://www.text.com"> blah text blah </a> blah text blah.

Thanks.

Aren't the first and last two "blahs" also "not part of an anchor tag?" — Jay, Jan 25 '10 at 10:12
@Jay - I assume the OP wants to `magic_replace(html, 'text', 'link still ok')` — Kobi, Jan 25 '10 at 10:19
@Jay: Presumably he's doing `s/text/replacement/g`, so the blahs don't match. But this is not a job for a regex (alone). — cjm, Jan 25 '10 at 10:21
Ah... got it. Yes, refer to the seminal text on the subject: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — Jay, Jan 25 '10 at 10:29
It is said that in Ulthar, which lies beyond the river Skai, no man may parse html with a regex. — daotoad, Jan 25 '10 at 18:31

score 8 · Answer 1 · answered Jan 25 '10 at 10:33

You don't want to try to parse HTML with a regex. Try HTML::TreeBuilder instead.

use HTML::TreeBuilder;

my $html = HTML::TreeBuilder->new_from_file('file.html');
# or some other method, depending on where your HTML is

doReplace($html);

sub doReplace
{
  my $elt = shift;

  foreach my $node ($elt->content_refs_list) {
    if (ref $$node) {
      doReplace($$node) unless $$node->tag eq 'a';
    } else {
      $$node =~ s/text/replacement/g;
    } # end else this is a text node
  } # end foreach $node

} # end doReplace

score 1 · Accepted Answer · edited May 23 '17 at 12:18

1

I have temporarily prevailed:

$html =~ s|(text)([^<>]*?<)(?!\/a>)|replacement$2|is;

but I was dispirited, dismayed, and enervated by the seminal text; and so shall pursue Treebuilder in subsequent endeavors.

edited May 23 '17 at 12:18

Community

1
1

answered Jan 25 '10 at 10:55

zylstra

740
1
8
22

Use of regex html parsers will cause you to wind up like Charles Dexter Ward. – daotoad Jan 25 '10 at 18:28
Your regex will also replace the "text" inside `text`, because it only looks at the first end tag. – cjm Jan 25 '10 at 19:41
it depends on what you're parsing - if they are small, regular lines of HTML output by another process for example, then a regex might be appropriate. if they are actual full HTML pages, then a proper HTML parser makes sense... – plusplus Jan 26 '10 at 11:01

score 0 · Answer 3 · answered Jan 25 '10 at 10:24

0

Don't use regexps for this kind of stuff. Use some proper HTML parser, and simply use plain regexp for parts of html that you're interested in.

answered Jan 25 '10 at 10:24

How can I replace text that is not part of an anchor tag in Perl?

3 Answers3