How to extract a column of a table from html page using perl modules?

Question

I have the following html code of a part of a webpage.

<h2 id="failed_process">Failed Process</h2>
<table border="1">
  <thead>
    <tr>
      <th>
        <b>pid</b>
      </th>
      <th>
        <b>Priority</b>
      </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td id="90"><a href="details.jsp?pid=p_201211162334&refresh=0">p_201211162334</a></td>
      <td id="priority_90">NORMAL</td>
    </tr>
    <tr>
      <td id="91"><a href="details.jsp?pid=p_201211163423&refresh=0">p_201211163423</a></td>
      <td id="priority_91">NORMAL</td>
    </tr>
    <tr>
      <td id="98"><a href="details.jsp?pid=p_201211166543&refresh=0">p_201211166543</a></td>
      <td id="priority_98">NORMAL</td>
    </tr>
  </tbody>
</table>
<hr>

I need to extract the pid column . The output should look like

pid
p_201211162334
p_201211163423
p_201211166543

The table count for "Failed Process" table is 4. But the problem is if I mention the table count as 4 and if there are no failed tasks in the webpage, it'll go to the next table and fetch the pid's of next table resulting in wrong pid's.

I am using the below code to get the result.

#!/usr/bin/perl
 use strict; 
 use warnings;
 use lib qw(..);
 use HTML::TableExtract;

 my $content = get("URL");
 my $te = HTML::TableExtract->new(
 headers => [qw(pid)], attribs => { id => 'failed_process' },
 );

 $te->parse($content);

 foreach my $col ($te->rows) {
 print ("\t", @$col), "\n";
 }

But I am getting the following error:

Can't call method "rows" on an undefined value

the `id => 'failed_process'` is not the id of the table, but the header — orhanhenrik, Jan 07 '13 at 07:47
Yes!! How do i extract pid referring to that id of the header? — UKR, Jan 07 '13 at 07:54
May be you need first get table and then get rows? Please look at examples of using module - http://search.cpan.org/~msisk/HTML-TableExtract-2.11/lib/HTML/TableExtract.pm — Kostia Shiian, Jan 07 '13 at 08:07
KostiaShiian : I looked into the above mentioned link many times. It doesn't have the required details.. !! — UKR, Jan 07 '13 at 08:16
Check out [Mojo::UserAgent](http://search.cpan.org/~sri/Mojolicious-3.73/lib/Mojo/UserAgent.pm). It can be done quite easily with this module. — Joakim, Jan 07 '13 at 09:20
This would be much simpler to answer if we could see the full HTML page. Are you saying that the table is omitted altogether from the page if there are no failed processes? Is the subsequent table (the one after the failed processes table) always present? You have omitted the `` element from your example; is it just a bare `
` or does it have any attributes? — Borodin, Jan 07 '13 at 10:41
I'm sorry, I didn't notice the opening `` tag because it was after the `
` element and on the same line. Please ignore that part of my comment. I have reformatted the HTML in your question for better clarity. — Borodin, Jan 07 '13 at 10:48
Yes. The whole table gets omitted from the page if there are no failed process and yes,the subsequent table is always present.i do have mentioned the element in the code. The table just have the h2 tag with its id and no other attributes. — UKR, Jan 07 '13 at 10:52

score 1 · Answer 1 · answered Jan 07 '13 at 13:10

With my favourite DOM parser Mojo::DOM from the Mojolicious suite it would look like that:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# instantiate with all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });

# extract all first column cells
$dom->find('table tr')->each(sub {
    my $cell = shift->children->[0];
    say $cell->all_text;
});

__DATA__
<h2 id="failed_process">Failed Process</h2>
<table border="1">
    ...

Output:

pid
p_201211162334
p_201211163423
p_201211166543

score 0 · Answer 2 · edited Jan 07 '13 at 13:13

0

After $te->parse($html) you may add some like foreach my $table ($te->tables) .. then you can get rows $table->rows. You may also use Data::Dumper to analyze $te.

edited Jan 07 '13 at 13:13

memowe

2,656
16
25

answered Jan 07 '13 at 10:18

Kostia Shiian

1,024
7
12

Hmmm..I Tried with the above changes.. Its not Working.. !! – UKR Jan 07 '13 at 10:32
Did you analyze output of Data::Dumper($te)? – Kostia Shiian Jan 07 '13 at 10:36
yes i did.. this is the small snippet of the o/p $VAR1 = bless( {'_ts_sequential' => [], 'headers' => ['pid' ], 'br_translate' => 1, 'gridmap' => 1, 'attribs' => { 'id' => 'failed_process' }, '_counts' => [ 5, 3 ], '_tablestack' => [] }, 'HTML::TableExtract' ); – UKR Jan 07 '13 at 11:03
This code works: use strict; use warnings; use HTML::TableExtract; use Data::Dumper; my $content; open (F,'<' , $ARGV[0]); while () { $content.= $_; } my $te = HTML::TableExtract->new( headers => [qw(pid)] ); $te->parse($content); my @table = $te->tables; foreach my $row ($table[0]->rows) { print join(',', @$row), "\n"; } – Kostia Shiian Jan 07 '13 at 18:35
Thanks Shiian!! But when I use this code directly with the URL, it doesn't work!! – UKR Jan 08 '13 at 03:59
$content must contains html code. You need check it. And I don`t inderstand what is get? Do you use LWP::Simple? – Kostia Shiian Jan 08 '13 at 06:29

How to extract a column of a table from html page using perl modules?

` element and on the same line. Please ignore that part of my comment. I have reformatted the HTML in your question for better clarity.

2 Answers2