Trying to use HTML::TableExtract in Perl to extract table from HTML file, but failing

Question

I am trying to extract information for each G protein-coupled receptor from tables from a site such as the following:

http://www.iuphar-db.org/DATABASE/ObjectDisplayForward?objectId=1&familyId=1

More specifically, I want to pull information from the columns (Ligand, Sp., Action, Affinity, Units). Currently, I have been outputting empty files from my extraction, so it would seem that the module is not recognizing the table I am specifying. Here is the code I have written thus far that was designed to go through each HTML file that corresponds to each G protein coupled receptor's information.

use warnings;
use strict;
use HTML::TableExtract;

my @names = `ls /home/wallakin/LINDA/ligands/iuphar/data/html`;

foreach (@names)
{
#Delete empty lines in HTML
open (IN, "</home/wallakin/LINDA/ligands/iuphar/data/html/$_") or die "Can't open html";
my @htmllines = <IN>;
close IN;
for (@htmllines)
{
    s/^\s*$// or s/^\s*//;
}
open (OUT, ">/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");
print OUT @htmllines;
close OUT;

#Extract data from HTML tables based on column headers
my $te = HTML::TableExtract->new ( 
                    headers => [ qw(Ligand Sp. Action Affinity Units) ],
                    depth => 1,
                    count => 1


                    );


$te->parse_file("/home/wallakin/LINDA/ligands/iuphar/data/html2/$_");

my $output = $_;
$output =~ s/\.html/\.txt/g;
open (RESET, ">/home/wallakin/LINDA/ligands/iuphar/data/ligands/$output");
close RESET;
open (DATA, ">>/home/wallakin/LINDA/ligands/iuphar/data/ligands/$output");
binmode (DATA, ":utf8");
binmode (STDOUT, ":utf8");  


foreach my $ts ($te->tables)
{
    print "Table (", join(',', $ts->coords), "):\n";


    foreach my $row ($te->rows)
    {

        foreach ( grep {defined} @$row)
        {
            $_ =~ s/\n/\ /g;
            $_ =~ s/\r//g;  
            #$_ =~ s/\s+/ /g;
        }

        #Each column's data separated by tabs
        print DATA join ("\t", grep {defined} @$row),"\n";
    }
}
close DATA;
}

I wrote a previous program (that worked, thankfully) that gets all my respective HTML files for each G protein-coupled receptor and have been passing it into this program. I'm not sure if I used the right headers, depth, or count.

I apologize if this post sounds stupid in any way, but I am new to bioinformatics and programming, in general. Thanks for any help!

Is one of the HTML files in /home/wallakin/LINDA/ligands/iuphar/data/html2/ just the raw HTML from the URL you provided? Your code is straight-forward, and should be easy to debug, assuming others (like me) can reproduce the input data. Assuming HTML::TableExtract works, which isn't something I know for sure yet, either. — jimtut, Aug 12 '13 at 18:55
Thanks for the reply! Yes, it's the raw HTML file edited with regular expression substitutions to remove empty lines. — Wally, Aug 12 '13 at 19:01

score 1 · Accepted Answer · answered Aug 12 '13 at 23:13

1

This seems to work with the URL you provided:

use 5.014;
use strict;
use warnings;
use open qw(:std :utf8);

use HTML::TableExtract;

my $te = HTML::TableExtract->new(
    headers => [qw(Ligand Sp. Action Affinity Units Reference)],
);

$te->parse_file('sample.html');

my @tables = $te->tables;
for my $t (@tables) {
    my @rows = $t->rows;
    for my $r (@rows) {
        for my $c (@$r) {
            $c =~ s/\A\s+//;
            $c =~ s/\s+\z//;
        }
        say "@$r";
    }
}

answered Aug 12 '13 at 23:13

Sinan Ünür

116,958
15
196
339

I tried this and got this error: "String found where operator expected at extract_iuphar_dbdata.pl line 50, near "say "@$r"" (Do you need to predeclare say?)" – Wally Aug 13 '13 at 14:02
He's using Perl 5.14. If you dropped that "use perl" line because you have an older Perl rev, just replace "say" with "print". – jimtut Aug 13 '13 at 14:08
All things aside, when I try to use a print command on @$row, I get this error: "Global symbol "$row" requires explicit package name at extract_iuphar_dbdata.pl line 49." – Wally Aug 13 '13 at 14:10
Problem solved. It turns out that I forgot to chomp my list of filenames, which caused my parser not to recognize the filenames I fed through it. Otherwise, your code worked like a charm! Thanks a bunch, y'all! – Wally Aug 13 '13 at 14:49

Trying to use HTML::TableExtract in Perl to extract table from HTML file, but failing

1 Answers1

Linked