How to parse html with HTML::TreeBuilder?

Question

This is the code I'd like to parse

[...]
<div class="item" style="clear:left;">
 <div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
 </div>
  <h2>Acid Splash</h2>
   <p>Caster Level(s): Wizard / Sorcerer 0
   <br />Innate Level: 0
   <br />School: Conjuration
   <br />Descriptor(s): Acid
   <br />Component(s): Verbal, Somatic
   <br />Range: Medium
   <br />Area of Effect / Target: Single
   <br />Duration: Instant
   <br />Save: None
   <br />Spell Resistance: Yes
   <p>
   You fire a small orb of acid at the target for 1d3 points of acid damage.
 </div>
[...]

This is my algorithm:

my $text = '';

scan_child($spells);

print $text, "\n";

sub scan_child {
  my $element = $_[0];
  return if ($element->tag eq 'script' or
             $element->tag eq 'a');   # prune!
  foreach my $child ($element->content_list) {
    if (ref $child) {  # it's an element
      scan_child($child);  # recurse!
    } else {           # it's a text node!
      $child =~ s/(.*)\:/\\item \[$1\]/; #itemize
      $text .= $child;
      $text .= "\n";
    }
   }
  return;
}

It gets the pattern <key> : <value> and prunes garbage like <script> or <a>...</a>. I'd like to improve it in order to get <h2>...</h2> header and all the <p>...<p> block so I can add some LaTeX tags.

Any clue?

Thanks in advance.

Perhaps you should take a step back and work out what information you want to extract from the page(s) you're scraping, and how you want to store it. If you have a certain schema or data structure in mind, it would be helpful to add it to the question. If you're just looking to extract all the text, you're already well on your way there. — i alarmed alien, Sep 25 '14 at 20:58
Maybe, I still have not clear what HTML::TreeBuilder stores in nodes. — Daniele, Sep 25 '14 at 21:39

score 0 · Answer 1 · answered Sep 25 '14 at 20:58

Because this may be an XY Problem...

Mojo::DOM is a somewhat more modern framework for parsing HTML using css selectors. The following pulls the P element that you want from the document:

use strict;
use warnings;

use Mojo::DOM;

my $dom = Mojo::DOM->new(do {local $/; <DATA>});

for my $h2 ($dom->find('h2')->each) {
    next unless $h2->all_text eq 'Acid Splash';

    # Get following P
    my $next_p = $h2;
    while ($next_p = $next_p->next_sibling()) {
        last if $next_p->node eq 'tag' and $next_p->type eq 'p';
    }

    print $next_p;
}

__DATA__
<html>
<body>
<div class="item" style="clear:left;">
 <div class="icon" style="background-image:url(http://nwn2db.com/assets/builder/icons/40x40/is_acidsplash.png);">
 </div>
  <h2>Acid Splash</h2>
   <p>Caster Level(s): Wizard / Sorcerer 0
   <br />Innate Level: 0
   <br />School: Conjuration
   <br />Descriptor(s): Acid
   <br />Component(s): Verbal, Somatic
   <br />Range: Medium
   <br />Area of Effect / Target: Single
   <br />Duration: Instant
   <br />Save: None
   <br />Spell Resistance: Yes
   <p>
   You fire a small orb of acid at the target for 1d3 points of acid damage.
 </div>
 </body>
 </html>

Outputs:

<p>Caster Level(s): Wizard / Sorcerer 0
   <br>Innate Level: 0
   <br>School: Conjuration
   <br>Descriptor(s): Acid
   <br>Component(s): Verbal, Somatic
   <br>Range: Medium
   <br>Area of Effect / Target: Single
   <br>Duration: Instant
   <br>Save: None
   <br>Spell Resistance: Yes
   </p>

score 0 · Answer 2 · answered Sep 25 '14 at 21:03

0

I use the look_down() method scan HTML. Using look_down() I can return first get a list of all the divs of class="item".

Then I can iterate of them, and find and process the h2 and the p, which I would then split using // as my splitter.

answered Sep 25 '14 at 21:03

Len Jaffe

3,442
1
21
28

How to parse html with HTML::TreeBuilder?

2 Answers2