-1

I am very new to the perl programming and now got stuck very badly.Actually i have to parse a html file containing a single table and i have to extract a row from there whose one column entry is known to me.

my html file looks like this-

many previous rows description in html format....

<td>some_value_default</td>
<td>0x0</td>
<td><a href="something" target="xyz">something</a></td>
<td>abcd</td>

//*

<tr><a name="Maximum_Capacity"></a>

<td>some 23:4</td>
<td>some_27: 15</td>
<td>24:29</td>
<td>17</td>
<td colspan=3>Maximum_Capacity</td>
<td colspan=5>
some commonly use value are:  24:31|25:67|677:89|xyz abc    
</td>
//*

<td>some_value_default</td>
<td> 0x0</td>
<td><a href="something.html" target="ren">sometext</a></td>
<td>again some text</td>

description of many rows in html afterwards...

The line between //* is indicating a row which i want to fetch.I want to use information contained in it.How to fetch that row in an array such that each column entry is stored as an array element.

please folks try to help me with that.

zdim
  • 64,580
  • 5
  • 52
  • 81
rikki
  • 431
  • 1
  • 8
  • 18

1 Answers1

5

Use HTML::TableExtract to process tables in an HTML document. It's an excellent tool.

A very basic example

use warnings;
use strict;
use feature 'say';

use List::MoreUtils qw(none);
use HTML::TableExtract;

my $file = shift @ARGV;
die "Usage: $0 html-file\n" if not $file or not -f $file;

my $html = do {  # read the whole file into $html string
    local $/;
    open my $fh, '<', $file or die "Can't open $file: $!";
    <$fh>;
};

my $te = HTML::TableExtract->new;
$te->parse($html);

# Print all tables in this html page
foreach my $ts ($te->tables) {
   say "Table (", join(',', $ts->coords), "):";
   foreach my $row ($ts->rows) {
      say "\t", join ',', grep { defined } @$row;
   }
}

# Assume that the table of interest is the second one
my $table = ($te->tables)[1];    
foreach my $row ($table->rows) {
    # Select the row you need; for example, identify distinct text in a cell
    next if none { defined and /Maximum_Capacity/ } @$row;
    say "\t", join ',', grep { defined } @$row;
}

The module provides many ways to set up parsing preferences, specify tables, retrieve elements, use headers, etc. Please see documentation and search this site for related posts.

I used none from List::MoreUtils to test if no elements of a list satisfy a condition.

Also see this post and this post, with different processing details, and search for more.

zdim
  • 64,580
  • 5
  • 52
  • 81
  • Actually the problem is i am working on a Company's server and it does not have TableExtract package installed and requesting installation for package will take atleast 2 weeks. I can't wait till then that's why i want some alternative – rikki Aug 16 '18 at 09:30
  • I think I'd prefer `next if none { defined() and /Maximum_Capacity/ } @$row` or even just `next if none { $_ and /Maximum_Capacity/ } @$row` – Borodin Aug 16 '18 at 15:23
  • 1
    @rikki The alternatives aren't pleasant. There are general X|HTML parsers (that come installed -- I think?) but using them on tables is _much_ harder and will incur a lot more work. Another "option" is to parse the html text "by hand" but that's a picky and tricky affair with no good end, and I'd strongly advise not to go there. – zdim Aug 16 '18 at 18:13
  • 1
    @rikki Why not install the module as a user? You need absolutely no permissions for that. Then you can develop your code, probably use it in production that way -- and once the package is installed by admins you have the code ready. – zdim Aug 16 '18 at 18:14
  • @Borodin Absolutely -- what I have processes the list twice; I got lulled by the earlier use of filtering. Fixed (and i keep `defined`), thank you. – zdim Aug 16 '18 at 18:20