0

I was trying to parse some webpage's content using HTML::TreeBuilder and then do a manual XPath-like walk.

But I got something really weird.

This is the X-Path produced from the web page by Chrome's Developer Tools:

/html/body/table/tbody/tr/td[1]/table[3]/tbody/tr[1]/td[2]/
table[1]/tbody/tr[1]/td[2]/**table[9]** 

That last inner table #9 is what I need - more specifically, a cell that has "click to view" text in it.

Here's the developer tools screenshot - notice that BODY tag only has one table under it:

enter image description here

And if you drill down into that XPath you will see the element I seek (Notice it's really nested table within table within table - I included the TD element I seek):

enter image description here




HOWEVER, This is what HTML::TreeBuilder produced instead (Basically, a <body> tag containing 22 tags under it most of which are <table> tags:

  DB<16>  x $tree->tag
0  'body'

  DB<17>  x map {$_->tag} $tree->content_list
0  'table'
1  'table'
2  'table'
3  'table'
4  'table'
5  'table'
6  'table'
7  'table'
8  'table'
9  'table'
10  'table'
11  'table'
12  'table'
13  'table'
14  'table'
15  'table'
16  'table'
17  'table'
18  'table'
19  'script'
20  'table'
21  'table'

And as you can see, the 8th table under BODY TAG contains the element I want

  DB<37> foreach my $c (0 .. $tree->content_list-1) { 
           if (($tree->content_list)[$c]->as_HTML =~ /click to view/)
              {print $c+1}}
9
DVK
  • 126,886
  • 32
  • 213
  • 327
  • The code to produce the tree is `my $tree = HTML::TreeBuilder->new_from_content($html);` in case it matters – DVK Nov 24 '13 at 00:57
  • By the way why are you not using HTML::TreeBuilder::XPath? – gangabass Nov 24 '13 at 01:33
  • @gangabass - (1) my version of Strawberry Perl doesn't include it and (2) It took me a whole 10 mins to write the code to do XPath-like searching in the tree, so I didn't bother figuring out how to install it – DVK Nov 24 '13 at 02:31
  • 1
    remove stuff from the html until you have the shortest example that demonstrates the problem, and show us that (unless it becomes obvious what the problem is at that point) – ysth Nov 24 '13 at 03:53
  • @ysth - I'll try, but it's probably a better investment of time for me to just write a full tree traversal that finds the node I need in the tree ANYPLACE it is. That HTML isn't exactly... neat. My main question was, is there a known bug or constructor argument I need to know about that'd cause such behavior - seems that nobody knows of a specific one. I'll wait for cjm to chime in if he logs in, before spending 2 hours on ungnarling the HTML – DVK Nov 24 '13 at 15:45

1 Answers1

0

It's most likely that the page you're processing contains invalid HTML. In that situation it's open season on how that content should actually be rendered, and different software will make different choices.

I'm afraid there isn't much you can do about it apart from either processing the HTML without the help of a parser, or perhaps finding the error and fixing it before you put it through HTML::TreeBuilder. Neither of these is a very pleasant prospect.

Borodin
  • 126,100
  • 9
  • 70
  • 144