-1

The documentation on CPAN doesn't really explain this behavior unless I'm missing something. I've put together some quick test code to illustrate my problem:

#!/usr/bin/perl
use warnings;
use strict;

use HTML::TreeBuilder;

my $testHtml = " 
<body>
        <h1>
                <p> 
                        <p>HELLO!
                        </p> 
                </p> 
        </h1>
</body>";

my $parsedPage = HTML::TreeBuilder->new;
$parsedPage->parse($testHtml);
$parsedPage->eof();

my @p = $parsedPage->look_down('_tag' => 'p');

foreach (@p) {print $_->parent->tag, " : ", $_->tag, "\t", $_->as_text, "\n";}

After running the above script, the output is:

body : p

body : p        HELLO! 

Seeing as all the tags are nested one after another, I would think that the parent of the first p tag would be h1, and the parent of the second p tag would be p. Why is the parent function showing the body tag for both?

cjm
  • 61,471
  • 9
  • 126
  • 175
s2cuts
  • 193
  • 2
  • 3
  • 13

1 Answers1

2

Your HTML is invalid. And given that HTML::TreeBuilder is a subclass of HTML::Parser, I can only assume that the parser is doing what it can to transform your document into valid HTML.

You can call $parsedPage->as_HTML to see what the parser has done to your HTML. It gives me this:

<html><head></head><body><h1></h1><p><p>HELLO! </body></html>

Perhaps you should pass your HTML through a validator or HTML::Tidy, before processing it.

Dave Cross
  • 68,119
  • 3
  • 51
  • 97
  • hmmm, the sample is only a sort of recreation of some HTML I need to parse. I'm not sure what the best way of dealing with invalid HTML would be... – s2cuts Jan 31 '11 at 11:52
  • Actually, HTML::Parser doesn't know or care what tags are allowed to nest inside each other. All it does is recognize start tags, end tags, text, etc. It's HTML::TreeBuilder that takes the events generated by HTML::Parser and constructs a validly nested tree. It tries to deal with invalid HTML in the same way most browsers do. – cjm Sep 28 '12 at 20:00