How to Parse a webpage

Question

I am attempting to extract the following from the EnviroCanada weather page.

I am trying to get for each hour as per the following.

Time | Thigh | Tlow | Humidity

7:00 | 23 | 22.9 | 30

Extracted HTML Page:

<tr>
         <td headers="header1" class="text-center vertical-center"> 7:00 </td>
        <td headers="header2" class="media vertical-center"><span class="pull-left"><img class="media-object" height="35" width="35" src="/weathericons/small/02.png" /></span><div class="visible-xs visible-sm">
            <br />
            <br />
          </div>
          <div class="media-body">
            <p>Partly Cloudy</p>
          </div>
        </td>
        <td headers="header3m" class=" metricData text-center vertical-center">23
                                            �(22.9)
                                        </td>
        <td headers="header3i" class=" imperialData hidden text-center vertical-center">73
                                            �(73.2)
                                        </td>
        <td headers="header4m" class="metricData text-center vertical-center">
          <abbr title="West-Northwest">WNW</abbr> 8</td>
        <td headers="header4i" class="imperialData hidden text-center vertical-center">
          <abbr title="West-Northwest">WNW</abbr> 5</td>
        <td headers="header6" class="metricData text-center vertical-center">30</td>
        <td headers="header6" class="imperialData hidden text-center vertical-center">87</td>
        <td headers="header7" class="text-center vertical-center">83</td>
        <td headers="header8" class="metricData text-center vertical-center">20</td>
        <td headers="header8" class="imperialData hidden text-center vertical-center">68</td>
        <td headers="header9m" class="metricData text-center vertical-center">100.7</td>
        <td headers="header9i" class="imperialData hidden text-center vertical-center">29.7</td>
        <td headers="header10" class="metricData text-center vertical-center">24</td>
        <td headers="header10" class="imperialData hidden text-center vertical-center">15</td>
      </tr>

Code so far:

use strict;
use warnings;
use LWP::Simple;
use HTML::TokeParser;


 my $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";
 my $page = get($url) ||
die "Could not load URL\n";


 my $parser = HTML::TokeParser->new(\$page) ||
die "Parse error\n";

 $parser->get_tag("td") foreach ();
 $parser->get_tag("");
 my $time = $parser->get_text();

  ??
 my $thigh = $parser->get_text();


 ???
 my $tlow = $parser->get_text();

 ???
 my $humid = $parser->get_text();

I'm Completely lost here

[HTML::TableExtract is very useful](https://www.nu42.com/2012/04/htmltableextract-is-beautiful.html). — Sinan Ünür, Jul 07 '16 at 18:16
I like Mojo::DOM for getting things out of HTML-pages, very nice to use. — asjo, Oct 03 '16 at 19:09

zdim · Accepted Answer · 2017-08-04T07:38:30.657

4

Once you fetch the page with LWP::Simple, you can pick a specific tool depending on what needs to be done with it, instead of using a general parser.

In this case you have a table on your hands and I'd recommend HTML::TableExtract. With it you can cleanly retrieve table elements in a number of ways and then process them. It can work with multiple tables, make use of headers, set up parsing preferences, and more. Normally you don't have to even look at the actual HTML. The module is a subclass of HTML::Parser. In my experience it's been a very good tool.

Here is some basic code, for this particular page and task.

use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;

my $url = "http://weather.gc.ca/past_conditions/index_e.html?station=yyz";
my $page = get($url) or die "Can't load $url: $!";

my $headers = [ 'Time', 'Temperature', 'Humidex' ];

my $tec = HTML::TableExtract->new(headers => $headers);
$tec->parse($page);

my $fmt = "%6s | %6s | %6s | %8s\n";    
printf($fmt, 'Time', 'T-high', 'T-low', 'Humidex');    

my ($time, $temp_hi, $temp_low, $hum);

foreach my $rrow ($tec->rows) {
    # Skip rows without expected data. Clean up leading/trailing spaces.
    next if $rrow->[0] !~ /^\s*\d?\d:\d\d/;
    my @row = map { s|^\s*||; s|\s*$||; $_ } @$rrow;
    # Process as needed
    ($time, $hum) = @row[0,2];
    ($temp_hi, $temp_low) = $row[1] =~ /(\d+) .* \( (\d+\.\d+) \)/xs;
    printf($fmt, $time, $temp_hi, $temp_low, $hum);
}

The first few rows of output

  Time | T-high |  T-low |  Humidex
 16:00 |     29 |   29.2 |       37
 15:00 |     27 |   27.2 |       37
 14:00 |     26 |   25.6 |       33
...

Comments.

The headers attribute for new makes it extract columns only under those headings. The loop variable is a reference, to an array with row elements. The elements are raw text in cells.

The first line skips rows that don't have the expected format – a possible digit \d? followed by another digit, then : then two digits. This is for time, 3:00 or 03:00.

The arrayref $rrow is extracted into an array @row for clarity. The sought elements in particular columns, @row[0,2] are used as they come. The one in $row[1] is parsed by a regex, which captures a number (\d+) and then two numbers separated by a ., with possible intervening text (.*). These captures are returned by regex, and assigned to the other two variables.

See the module's documentation and, if needed, tutorials on references perlreftut and on regular expressions perlretut. Another useful page is the Data Structures Cookbook perldsc. For other introductions see Tutorials. They typically have links to more specific docs.

edited Aug 04 '17 at 07:38

answered Jul 07 '16 at 18:10

zdim

64,580
5
52
81

The question is how do i extract for 3:00, 5:00 etc and place in appropriate column? They all fall under the same headers tags. – BrianB Jul 07 '16 at 19:06
I tried HTML::TableExtract (simple test) and it did not like 'my foreach': use LWP::Simple; use HTML::TableExtract; use Text::Table; my $doc = 'http://weather.gc.ca/past_conditions/index_e.html?station=yyz'; my $headers = [ 'Time', 'Temperature' ]; my $table_extract = HTML::TableExtract->new(headers => $headers); my $table_output = Text::Table->new(@$headers); $table_extract->parse_file($doc); my ($table) = $table_extract->tables; foreach my $row ($table->rows) { clean_up_spaces($row); # not shown for brevity $table_output->load($row); } print $table_output; – BrianB Jul 07 '16 at 19:32
@BrianB I've posted some basic but working code. Will clean it up when I get time (even though it looks like you don't need that!). Let me know how it goes. – zdim Jul 07 '16 at 19:45
Thanks appreciate the help – BrianB Jul 07 '16 at 19:50
Sorry for this delayed response - I have a problem with the results it extacts up to 10:00 and skips 9:00 - 1:00 and I can't figure out why – BrianB Sep 01 '16 at 01:17
The issue is the it only extract times with 2 digits ie HH:MM not H:MM Is there a way to fix that? – BrianB Sep 01 '16 at 01:45
1

@BrianB That turned out a long comment -- what I meant to say is: it's fixed. In the line `next if ...` I changed `/\d\d:\d\d/` to `/\d?\d:\d\d/`. So, added one character, the `?`. It means that the pattern before it is optional, there may be one or zero of it. Please let me know if adding some explanation(s) would be helpful. – zdim Sep 01 '16 at 02:49
Hi zdim - thanks I will run it tomorrow - but I more than confident that your fix will work – BrianB Sep 01 '16 at 04:05
I just ran it now - IT WORKS THANKS – BrianB Sep 01 '16 at 04:06
@BrianB Good, thanks for responding -- yes, this had to fix ... this. It's tricky to come up with a general rule for any and all HTML format glitches. Just let me know if something else comes up. – zdim Sep 01 '16 at 04:07
Hi Zdim can you provide some direction how how to do a horizontal column scrape? http://www.accuweather.com/en/ca/thunder-bay/p7b/hourly-weather-forecast/49563 I was trying to get TIME,TEMP,REAL,HUMIDITY – BrianB Oct 01 '16 at 15:59
@BrianB See the line `foreach my $row ...`. The `$row` is an _arrayref_, so `@$row` is the array. You can look at the first element, `$row->[0]` and when it contains `'Temp'` you get the rest of the array for Temp. If it has `'Real'` then the remaining elements are values for that. Etc. You do this by, say, `$row->[0] =~ /^\s*Temp/` (find `Temp` at start of the line except for possible spaces). If that's what you mean ...? You also need to first _see_ the table, there could be empty cells etc (print out all rows and look at it). The Humidity' seems to involve another table. – zdim Oct 02 '16 at 04:39
@BrianB If arrayrefs are giving you trouble, extract it to an array and work with that -- `my @arr = @$refarr`. Let me know if the previous comment doesn't make sense. – zdim Oct 02 '16 at 04:41
Thanks for the quick response _zdim, I just walked back in - I will go through this in the morning Thanks again for your help. – BrianB Oct 02 '16 at 05:02
@BrianB Welcome -- let me know how it goes. – zdim Oct 02 '16 at 05:13
Hi zdim thanks for your help. I am going to do some reading as I need to fully understand this utility better. Thanks – BrianB Oct 02 '16 at 17:01
@BrianB OK. Let me know if things pop up, and another question is always possible. The basic thing you get from the module is that it parses the table (gets all the details out), and returns the rows in arrayrefs. That's the `forearh my $rowref (...)`. So you can then go through them and parse specifics in each row with a regex. That's `@row = @$rowref` and process `@row`. (The `$row` and `@row` are completely different in Perl. My naming of `$row` may have been confusing -- `$rrow` or `$rowref` are clearer.) And then the tool offers a few others goodies. – zdim Oct 02 '16 at 19:01
First of I appreciate your help here. This is obviously more advance than I had expected when I decided to tackle it. Thanks for the tutorial links I was off trying to get some background but your links look to be better than what I was coming across. You made reference to 'other question'. Are you referring to that follow-up question within this stream or a separate question all together? BTW I dont mean to waist your time with this, so again I appreciate your help. Brian – BrianB Oct 03 '16 at 13:58
@BrianB You are welcome. Thanks for concern about my time! It's OK, if I didn't have time I'd let you know :) I realized that it's probably things like references (and regex) that's in the way. They are everywhere, if you use Perl in any way it's better to just go over it. Basics aren't hard -- a reference is a pointer to some data structure (if C is OK). Basic regex is simple-minded, then it gets hairy. The "other question" I meant is [here](http://stackoverflow.com/questions/39261542/parsing-webpage-how-to-extract-times-with-single-hour-digits). I dropped a message there. – zdim Oct 03 '16 at 17:31

How to Parse a webpage

1 Answers1

Linked