2

I'm trying to programmatically scrape the files from this page: https://olms.dol-esa.gov/query/getYearlyData.do (yes it probably would be faster to download them manually but I want to learn how to do this).

I have the following bit of code to try to attempt this on one of the files as a test:

#!/usr/bin/perl
use strict;
use warnings;
use WWW::Mechanize; 

my $mech = WWW::Mechanize->new;

$mech->get( 'https://olms.dol-esa.gov/query/getYearlyData.do' );
print $mech->uri();
$mech->submit_form( with_fields => { selectedFileName => '/filer/local/cas/YearlyDataDump/2000.zip' } );

When I run the code, nothing happens. Nothing gets downloaded. Thinking javascript might be the problem, I also tried the same code with WWW::Mechanize::Firefox. Again, nothing happens when I run the code.

I also don't see the paths to the files. It's probably obscured in some javascript.

So what's the best way to get these files? Is it possible to get them without javascript?

StevieD
  • 6,925
  • 2
  • 25
  • 45
  • Disable JavaScript in your browser and you'll see that the page is useless without it, so WWW::Mechanize is out. Instead of using WWW::Mechanize::Firefox, though, you should see if the data is available via an API; that's almost always a better choice than scraping. [Here](http://developer.dol.gov/) is the main page for the Department of Labor API. – ThisSuitIsBlackNot Feb 02 '17 at 23:12
  • Yes, I'm aware the page shows up blank without javascript turned on. However, the source code is still there and I'm curious to know why a POST request, with the appropriate fields sent with the request, doesn't cause the server to send the document, especially with WWW::Mechanize::Firefox. – StevieD Feb 02 '17 at 23:47
  • 1
    Look at the source with JS disabled: there's no `
    ` element, so naturally `submit_form` won't do anything. As for W::M::F not working, you didn't set a `submitButton` parameter, and there are other headers that you haven't set that might be required. But again, this is exactly what APIs are for; you shouldn't be spelunking around in the guts of a web page made for human consumption because it could change at any time and break your code in a million different ways.
    – ThisSuitIsBlackNot Feb 03 '17 at 00:12

1 Answers1

1

While the comments by ThisSuitIsBlackNot are spot on, there is a rather simple way of doing this programmatically without using JS at all. You don't even need WWW::Mechanize.

I've used Web::Scraper to find all the files. As you said, the form values are there. It's a matter of scraping them out. WWW::Mechanize is good at navigating, but not very good at scraping. Web::Scraper's interface on the other hand is really easy.

Once we have the files, all we need to do is submit a POST request with the correct form values. This is pretty similar to WWW::Mechanize's submit_form. In fact, WWW::Mechanize is an LWP::UserAgent under the hood, and all we need is a request, so we ca use it directly.

The :content_file option on the post method tells it to put the response into a file. It will do the right thing with the ZIP file and write it as binary automatically.

use strict;
use warnings;
use LWP::UserAgent;
use Web::Scraper;
use URI;

# build a Web::Scraper to find all files on the page
my $files = scraper {
    process 'form[name="yearlyDataForm"]',    'action'  => '@action';
    process 'input[name="selectedFileName"]', 'files[]' => '@value';
};

# get the files and the form action
my $res = $files->scrape( URI->new('https://olms.dol-esa.gov/query/getYearlyData.do') );

# use LWP to download them one by one
my $ua = LWP::UserAgent->new;
foreach my $path ( @{ $res->{files} } ) {

    # the file will end up relative to the current working directory (.)
    ( my $filename ) = ( split '/', $path )[-1];

    # the submit is hardcoded, but that could be dynamic as well
    $ua->post(
        $res->{action},
        { selectedFileName => $path, submitButton => 'Download' },
        ':content_file' => $filename # this downloads the file
    );
}

Once you run this, you'll have all the files in the directory of your script. It will take a moment and there is no output, but it works.

You need to make sure to include the submit button in the form.

Since you wanted to learn how to do something like this, I've built it slightly dynamic. The form action gets scraped as well, so you could reuse this on similar forms that use the same form names (or make that an argument) and you'd not have to care about the form action. The same thing could also be done with the submit button, but you'd need to grab both the name and the value attribute.

I'll repeat what ThisSuitIsBlackNot said in their comment though: Scraping a website always comes with the risk that it changes later! For a one-time thing that doesn't matter, if you would want to run this as a cronjob once a year it might already fail next year because they finally updated their website to be a bit more modern.

Community
  • 1
  • 1
simbabque
  • 53,749
  • 8
  • 73
  • 136
  • 1
    Very ingenious. I am familiar with Web::Scraper but did not think to use it to pull the necessary form elements like that. I also did not think about use LWP::UserAgent like that either. I learned something which is what I was looking for. Thanks! Yes, I'm well aware of the limitations of scraping. This is more of an academic exercise so I can polish my skills. – StevieD Feb 03 '17 at 13:35
  • @StevieD glad I could help. :) – simbabque Feb 03 '17 at 13:43