Web crawler text formatting

Question

I have the following code to access a HTML table.

my $table = $tree->look_down(_tag => "table", id => "moduleDetail");

however the text is coming down not formatted, because the web page uses the tables borders to divide certain pieces of text. So its coming down something like this, "mathematics for computingJordanstown" with jordanstown being I assume in the next cell. here is the code that i am using,

my @array; 
my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
my $table = $tree->look_down(_tag => "table", id => "moduleDetail");




    for ($table ->look_down(_tag => 'tr')) {

                push(@array,$_->as_text());

    }

    foreach(@array){
           print $_, " ";
                    }
$tree->delete();

Note i tried to separate the text using and array but no luck? any pointers. Thanks

Can you show us some input text? – brian d foy Apr 06 '12 at 21:22 — brian d foy, Apr 06 '12 at 21:22

score 1 · Answer 1 · answered Apr 06 '12 at 23:49

Accessing text nodes of the HTML tree is made much easier if you call the objectify_text method on the tree. This changes the text nodes from simple strings to instances of HTML::Element with a pseudo tag name of ~text and an attribute called text equal to the text string. This allows the look_down method to search for text nodes.

If you recode like this you will get the value of each separate text node pushed onto the array.

my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
$tree->objectify_text;

my $table = $tree->look_down(_tag => "table", id => "moduleDetail");

my @text; 

for my $tr ($table->look_down(_tag => '~text')) {
  my $text = $tr->attr('text');
  push @text, $text if $text =~ /\S/;
}

print "$_\n" for @text;

Brant Olsen · Accepted Answer · 2013-07-11T18:41:52.150

0

Using HTML::TreeBuilder::XPath

I suggest using the Perl module HTML::TreeBuilder::XPath for this. It should give you exactly what you want.

From the documentation, I believe your code would look like this using the XPath module

my $tree = HTML::TreeBuilder::XPath->new_from_content($mech->content);
my @trArray = $tree->findnodes_as_string( '//table[@id="moduleDetail"]/tr/td');
$tree->delete();

For more information on XPath see http://www.w3schools.com/xpath/.

Using HTML::TreeBuilder

If you want to stick with using HTML::TreeBuilder, then you will need to do the following

my $tree = HTML::TreeBuilder->new_from_content($mech->content);  
my $table = $tree->look_down(_tag => "table", id => "moduleDetail");
for ($table->look_down(_tag => 'td')) {
  push(@array,$_->as_text());   
}

edited Jul 11 '13 at 18:41

answered Apr 06 '12 at 19:32

Brant Olsen

5,628
5
36
53

what is Xpath how can i use it? – aspiringCoder Apr 06 '12 at 19:35
Follow the link I provided. You install it just like HTML::TreeBuilder and use it in a similar way. – Brant Olsen Apr 06 '12 at 19:40
ok i do see what this is doing but i need to look in parent tag but stop it from accessing the child for example tr could have a tag inside id call small? – aspiringCoder Apr 06 '12 at 19:46
@StefanReaney I updated the example to find nodes as string. That way any extra tags inside should be ignored. – Brant Olsen Apr 06 '12 at 19:48
@StefanReaney If you still need better example code, please add some example HTML to your question. – Brant Olsen Apr 06 '12 at 19:59
ok um all i want to do is get all the information from a table with specific ID and because there are no spaces in the text between different cells, its isn't formatting very well meaning that 2 words can be stuck together, all i want to do is to read that information but make sure that doesnt occur? – aspiringCoder Apr 06 '12 at 20:03
@StefanReaney Updated with HTML::TreeBuilder code that I believe gives you want you want. If not let me know. – Brant Olsen Apr 06 '12 at 20:27
ahh i ammended that with a " ", and then pushed the data to the array and it worked :O – aspiringCoder Apr 06 '12 at 20:58
1

There is no need to find all `tr` nodes and then all `td` nodes within those. A simple `push @array, $_->as_text for $table->look_down(_tag => 'td')` will do the trick. – Borodin Apr 06 '12 at 23:52

Web crawler text formatting

2 Answers2