-3

Whenever I run the following Perl script I got the errors below

Use of uninitialized value $date in concatenation (.) or string at D:\sagar\toc\Online_TOC.pl line 111, <> line 1.
Use of uninitialized value $first_page in concatenation (.) or string at D:\sagar\toc\Online_TOC.pl line 111, <> line 1.
Use of uninitialized value $last_page in concatenation (.) or string at D:\sagar\toc\Online_TOC.pl line 111, <> line 1. 

The following code is run at the command prmpt by giving URL

http://ajpheart.physiology.org/content/309/11

It generates the meta_issue11.xml file but does not give proper output.

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

use HTML::Parser;
use WWW::Mechanize;

my ( $date, $first_page, $last_page, @toc );

sub get_date {
    my ( $self, $tag, $attr ) = @_;

    if ( 'span' eq $tag
        and $attr->{class}
        and 'highwire-cite-metadata-date' eq $attr->{class}
        and not defined $date )
    {

        $self->handler( text => \&next_text_to_date, 'self, text' );

    }
    elsif ( 'span' eq $tag
        and $attr->{class}
        and 'highwire-cite-metadata-pages' eq $attr->{class} )
    {
        if ( not defined $first_page ) {
            $self->handler( text => \&parse_first_page, 'self, text' );
        }
        else {
            $self->handler( text => \&parse_last_page, 'self, text' );
        }

    }
    elsif ( 'span' eq $tag
        and $attr->{class}
        and 'highwire-cite-metadata-doi' eq $attr->{class} )
    {
        $self->handler( text => \&retrieve_doi, 'self, text' );

    }
    elsif ( 'div' eq $tag
        and $attr->{class}
        and $attr->{class} =~ /\bissue-toc-section\b/ )
    {
        $self->handler( text => \&next_text_to_toc, 'self, text' );
    }
}

sub next_text_to_date {
    my ( $self, $text ) = @_;

    $text =~ s/^\s+|\s+$//g;
    $date = $text;
    $self->handler( text => undef );
}

sub parse_first_page {
    my ( $self, $text ) = @_;

    if ( $text =~ /([A-Z0-9]+)(?:-[0-9A-Z]+)?/ ) {
        $first_page = $1;
        $self->handler( text => undef );
    }
}

sub parse_last_page {
    my ( $self, $text ) = @_;

    if ( $text =~ /(?:[A-Z0-9]+-)?([0-9A-Z]+)/ ) {
        $last_page = $1;
        $self->handler( text => undef );
    }
}

sub next_text_to_toc {
    my ( $self, $text ) = @_;

    push @toc, [$text];
    $self->handler( text => undef );
}

sub retrieve_doi {
    my ( $self, $text ) = @_;

    if ( 'DOI:' ne $text ) {
        $text =~ s/^\s+|\s+$//g;
        push @{ $toc[-1] }, $text;
        $self->handler( text => undef );
    }
}

print STDERR 'Enter the URL: ';
chomp( my $url = <> );
my ( $volume, $issue ) = ( split m(/), $url )[ -2, -1 ];

my $p = 'HTML::Parser'->new(
    api_version => 3,
    start_h     => [ \&get_date, 'self, tagname, attr' ],
);

my $mech = 'WWW::Mechanize'->new( agent => 'Mozilla' );
$mech->get( $url );
my $contents = $mech->content;
$p->parse( $contents );
$p->eof;

my $toc;
for my $section ( @toc ) {
    $toc .= "<TocSection>\n";
    $toc .= "<Heading>" . shift( @$section ) . "</Heading>\n";
    $toc .= join q(), map "<DOI>$_</DOI>\n", @$section;
    $toc .= "</TocSection>\n";
}

open( F6, ">meta_issue_$issue.xml" );

print F6 <<"__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
<Provider>Cadmus</Provider>
<IssueDate>$date</IssueDate>
<PageRange>$first_page-$last_page</PageRange>
<TOC>$toc</TOC>
</MetaIssue>
__HTML__
Ashraf Bashir
  • 9,686
  • 15
  • 57
  • 82
Sagar
  • 43
  • 13

1 Answers1

3

The primary problem is that you're checking the class string for equality, whereas the required class may be just one of several space-separated class names

But there are a number of other issues, such as using WWW::Mechanize just to fetch a web page when LWP::Simple will do fine. And checking three times for 'span' eq $tag

Here's a working version. I would prefer to see XML::Writer used to create the output XML, but I have kept to using simple print statements, as in your own code

Note that comments like #/ are just there to persuade the Stack Overflow syntax highlighter to colour the text correctly. You should remove them in the live code

#!/usr/bin/perl
use strict;
use warnings 'all';

use LWP::Simple 'get';
use HTML::Parser;

my ( $date, $first_page, $last_page, @toc );

print 'Enter the URL: ';
my $url = <>;
$url ||= 'http://ajpheart.physiology.org/content/309/11';
chomp $url;

my ( $volume, $issue ) = ( split m(/), $url )[ -2, -1 ];  #/

my $p = 'HTML::Parser'->new(
    api_version => 3,
    start_h     => [ \&get_span_div, 'self, tagname, attr' ],
);

my $contents = get($url);
$p->parse( $contents );
$p->eof;

my $toc = '';
for my $section ( @toc ) {
    $toc .= "\n";
    $toc .= "    <TocSection>\n";
    $toc .= "      <Heading>" . shift( @$section ) . "</Heading>\n";
    $toc .= "      <DOI>$_</DOI>\n" for @$section;
    $toc .= "    </TocSection>";
}

open my $out_fh, '>', "meta_issue_$issue.xml" or die $!;

print  { $out_fh } <<"__HTML__";
<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="$volume" issue="$issue">
  <Provider>Cadmus</Provider>
  <IssueDate>$date</IssueDate>
  <PageRange>$first_page-$last_page</PageRange>
  <TOC>$toc
  </TOC>
</MetaIssue>
__HTML__
#/

sub get_span_div {
    my ( $self, $tag, $attr ) = @_;

    my $class = $attr->{class};
    my %class;
    %class = map { $_ => 1 } split ' ', $class if $class;

    if ( $tag eq 'span' ) {

        if ( $class{'highwire-cite-metadata-date'} ) {

            $self->handler( text => \&next_text_to_date, 'self, text' ) unless $date;
        }
        elsif ( $class{'highwire-cite-metadata-pages'} ) {

            if ( not defined $first_page ) {
                $self->handler( text => \&parse_first_page, 'self, text' );
            }
            else {
                $self->handler( text => \&parse_last_page, 'self, text' );
            }
        }
        elsif ( $class{'highwire-cite-metadata-doi'} ) {

            $self->handler( text => \&retrieve_doi, 'self, text' );
        }
    }
    elsif ( $tag eq 'div' ) {

        if ( $class{'issue-toc-section'} ) {
            $self->handler( text => \&next_text_to_toc, 'self, text' );
        }
    }
}

sub next_text_to_date {
    my ( $self, $text ) = @_;

    ($date = $text) =~ s/^\s+|\s+$//g;  #/
    $self->handler( text => undef );
}

sub parse_first_page {
    my ( $self, $text ) = @_;

    return unless $text =~ /(\w+)(-\w+)?/;  #/

    $first_page = $1;
    $self->handler( text => undef );
}

sub parse_last_page {
    my ( $self, $text ) = @_;

    return unless $text =~ /\w+-(\w+)/;  #/

    $last_page = $1;
    $self->handler( text => undef );
}

sub next_text_to_toc {
    my ( $self, $text ) = @_;

    push @toc, [ $text ];
    $self->handler( text => undef );
}

sub retrieve_doi {
    my ( $self, $text ) = @_;

    return unless $text =~ /\d+/;  #/

    $text =~ s/^\s+|\s+$//g;
    push @{ $toc[-1] }, $text;
    $self->handler( text => undef );
}

output

<!DOCTYPE MetaIssue SYSTEM "http://schema.highwire.org/public/toc/MetaIssue.pubids.dtd">
<MetaIssue volume="309" issue="11">
  <Provider>Cadmus</Provider>
  <IssueDate>December 1, 2015</IssueDate>
  <PageRange>H1793-H1996</PageRange>
  <TOC>
    <TocSection>
      <Heading>CALL FOR PAPERS | Cardiovascular Responses to Environmental Stress</Heading>
      <DOI>10.1152/ajpheart.00199.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>CALL FOR PAPERS | Autophagy in the Cardiovascular System</Heading>
      <DOI>10.1152/ajpheart.00709.2014</DOI>
    </TocSection>
    <TocSection>
      <Heading>CALL FOR PAPERS | Mechanisms of Diastolic Dysfunction in Cardiovascular Disease</Heading>
      <DOI>10.1152/ajpheart.00608.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>CALL FOR PAPERS | Small Vessels&ndash;Big Problems: Novel Insights into Microvascular Mechanisms of Diseases</Heading>
      <DOI>10.1152/ajpheart.00463.2015</DOI>
      <DOI>10.1152/ajpheart.00691.2015</DOI>
      <DOI>10.1152/ajpheart.00568.2015</DOI>
      <DOI>10.1152/ajpheart.00653.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>CALL FOR PAPERS | Exercise Training in Cardiovascular Disease: Mechanisms and Outcomes</Heading>
      <DOI>10.1152/ajpheart.00341.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>CALL FOR PAPERS | Cardiac Regeneration and Repair: Mechanisms and Therapy</Heading>
      <DOI>10.1152/ajpheart.00594.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>Vascular Biology and Microcirculation</Heading>
      <DOI>10.1152/ajpheart.00289.2015</DOI>
      <DOI>10.1152/ajpheart.00308.2015</DOI>
      <DOI>10.1152/ajpheart.00179.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>Muscle Mechanics and Ventricular Function</Heading>
      <DOI>10.1152/ajpheart.00284.2015</DOI>
      <DOI>10.1152/ajpheart.00327.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>Signaling and Stress Response</Heading>
      <DOI>10.1152/ajpheart.00050.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>Cardiac Excitation and Contraction</Heading>
      <DOI>10.1152/ajpheart.00055.2015</DOI>
    </TocSection>
    <TocSection>
      <Heading>Integrative Cardiovascular Physiology and Pathophysiology</Heading>
      <DOI>10.1152/ajpheart.00316.2015</DOI>
      <DOI>10.1152/ajpheart.00721.2014</DOI>
    </TocSection>
    <TocSection>
      <Heading>Corrigendum</Heading>
      <DOI>10.1152/ajpheart.H-zh4-1780-corr.2015</DOI>
    </TocSection>
  </TOC>
</MetaIssue>
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • Thanks Borodin..The above code is working perfectly right,I'll keep in mind your instruction – Sagar Dec 12 '15 at 14:28
  • How to get correct syntax highlighting for Perl on SO? It seems it fails `split m(/) ..` and then, for some reason the rest of the code is taken verbatim – Håkon Hægland Dec 12 '15 at 14:32
  • 1
    @HåkonHægland: You have to guess what has upset the highlighter and add comments to work around it. It's usually expecting another slash after `m//` or `s///`. I've done it in this case, but I think it's usually not worth the lack of clarity that pointless comments add to the code – Borodin Dec 12 '15 at 15:14
  • 1
    I think I will have a look the javascript source that SE uses for syntax highlighting at [google/prettify](https://github.com/google/code-prettify).. maybe it is possible to fix it so we do not have to bother with these silly workarounds? – Håkon Hægland Dec 12 '15 at 15:34
  • @Borodin How can one decode `$contents` when it contains Unicode(?) characters like `&ndash`. In the XML output it is in the line `CALL FOR PAPERS | Small Vessels–Big Problems:`. I've tried `use utf8;` at the top of my script and that fails. Been looking the past few days at finding a way to replace such characters from an HTML page. `LWP::Simple` doesn't have any way to do this I think. I copied your script and got the same results as above. – Chris Charley Dec 15 '15 at 01:05
  • @ChrisCharley: That should be the topic of a new question – Borodin Dec 15 '15 at 11:59