using Perl to scrape a website

Question

I am interested in writing a perl script that goes to the following link and extracts the number 1975: https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

That website is the amount of white men born in the year 1923 who live in San Diego County, California in 1940. I am trying to do this in a loop structure to generalize over multiple counties and birth years.

In the file, locations.txt, I put the list of counties, such as San Diego County.

The current code runs, but instead of the # 1975, it displays unknown. The number 1975 should be in $val\n.

I would very much appreciate any help!

#!/usr/bin/perl

use strict;

use LWP::Simple;

open(L, "locations26.txt");

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3A%22California%22%20%2Bevent_place_level_2%3A%22%LOCATION%%22%20%2Bbirth_year%3A%YEAR%-%YEAR%~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

open(O, ">out26.txt");
 my $oldh = select(O);
 $| = 1;
 select($oldh);
 while (my $location = <L>) {
     chomp($location);
     $location =~ s/ /+/g;
      foreach my $year (1923..1923) {
                 my $u = $url;
                 $u =~ s/%LOCATION%/$location/;
                 $u =~ s/%YEAR%/$year/;
                 #print "$u\n";
                 my $content = get($u);
                 my $val = 'unknown';
                 if ($content =~ / of .strong.([0-9,]+)..strong. /) {
                         $val = $1;
                 }
                 $val =~ s/,//g;
                 $location =~ s/\+/ /g;
                 print "'$location',$year,$val\n";
                 print O "'$location',$year,$val\n";
         }
     }

Update: API is not a viable solution. I have been in contact with the site developer. The API does not apply to that part of the webpage. Hence, any solution pertaining to JSON will not be applicbale.

That value is dynamic content produced with JavaScript that runs after the page is loaded, so your scrape will need to be JavaScript-capable. Check out [`WWW::Mechanize::Firefox`](http://search.cpan.org/perldoc?WWW::Mechanize::Firefox) for one possible solution. — mob, Feb 01 '13 at 20:32
You might consider using something from CPAN for this task, such as [Web::Scraper](https://metacpan.org/module/Web::Scraper) — Craig Treptow, Feb 01 '13 at 20:35
`%YEAR%` appears twice in `$url`, so you'll want to say `$u =~ s/%YEAR%/$year/g`, and AFAICT the number you want is not wrapped in a `strong` tag. But getting the content before JavaScript is done manipulating it is still your biggest problem. — mob, Feb 01 '13 at 20:49
Preemptively vote to reopen. This is a good question, and if it gets good answers there are many other people who would find it helpful. — mob, Feb 01 '13 at 20:52

score 8 · Answer 1 · answered Feb 01 '13 at 21:02

8

It would appear that your data is generated by Javascript and thus LWP cannot help you. That said, it seems that the site you are interested in has a developer API: https://familysearch.org/developers/

I recommend using Mojo::URL to construct your query and either Mojo::DOM or Mojo::JSON to parse XML or JSON results respectively. Of course other modules will work too, but these tools are very nicely integrated and let you get started quickly.

answered Feb 01 '13 at 21:02

Joel Berger

20,180
5
49
104

Per the upddted the post, the API doe not apply in this case. It will not be plasubiel. – user1690130 Feb 06 '13 at 23:41
I'm sorry to hear that, it actually seemed like it would be a fun source for making demos of API querying. Good luck – Joel Berger Feb 07 '13 at 14:18

score 6 · Answer 2 · edited Apr 08 '14 at 11:30

You could use WWW::Mechanize::Firefox to process any site that could be loaded by Firefox.

http://metacpan.org/pod/WWW::Mechanize::Firefox::Examples

You have to install the Mozrepl plugin and you will be able to process the web page contant via this module. Basically you will "remotly control" the browser.

Here is an example (maybe working)

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = WWW::Mechanize::Firefox->new(
    activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html');

my $retries = 10;
while ($retries-- and ! $mech->is_visible( xpath => '//*[@class="form-submit"]' )) {
      print "Sleep until we find the thing\n";
      sleep 2;
};
die "Timeout" if 0 > $retries;
#fill out the search form
my @forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>    
#A selector prefixed with '#' must match the id attribute of the input. A selector prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[@class="form-submit"]'});

@user1690130: did you download and properly set up WWW::Mechanize::Firefox? — Mooing Duck, Feb 04 '13 at 22:01
Does a file containing only "`use WWW::Mechanize::Firefox;`" work? If not, it's not properly set up. I'm still researching how to set up perl plugins for you. — Mooing Duck, Feb 04 '13 at 22:06
@user1690130 http://search.cpan.org/~corion/WWW-Mechanize-Firefox-0.68/lib/WWW/Mechanize/Firefox/Installation.pod talks about how to set it up, doesn't go into details on how to make the perl end work. Still looking... — Mooing Duck, Feb 04 '13 at 22:13
You posted in chat that you can't find the folder "`bard-mozrepl-abcdefg`". You also posted a screenshot (four times) that shows you are in the folder "`bard-mozrepl-abcdefg`", at which point I refused to continue to help. If you can't find the folder you're in, I'm not going to walk you through writing a website scraper in perl. — Mooing Duck, Feb 04 '13 at 23:13

nandhp · Accepted Answer · 2013-02-07T05:00:12.473

5

If you use your browser's development tools, you can clearly see the JSON request that the page you link to uses to get the data you're looking for.

This program should do what you want. I've added a bunch of comments for readability and explanation, as well as made a few other changes.

use warnings;
use strict;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;

# Create an LWP User-Agent object for sending HTTP requests.
my $ua = LWP::UserAgent->new;

# Open data files
open(L, 'locations26.txt') or die "Can't open locations: $!";
open(O, '>', 'out26.txt') or die "Can't open output file: $!";

# Enable autoflush on the output file handle
my $oldh = select(O);
$| = 1;
select($oldh);

while (my $location = <L>) {
    # This regular expression is like chomp, but removes both Windows and
    # *nix line-endings, regardless of the system the script is running on.
    $location =~ s/[\r\n]//g;
    foreach my $year (1923..1923) {
        # If you need to add quotes around the location, use "\"$location\"".
        my %args = (LOCATION => $location, YEAR => $year);

        my $url = 'https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A^LOCATION^%2520%252Bbirth_year%253A^YEAR^-^YEAR^~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219';
        # Note that values need to be doubly-escaped because of the
        # weird way their website is set up (the "/proxy" URL we're
        # requesting is subsequently loading some *other* URL which
        # is provided to "/proxy" as a URL-encoded URL).
        #
        # This regular expression replaces any ^WHATEVER^ in the URL
        # with the double-URL-encoded value of WHATEVER in %args.
        # The /e flag causes the replacement to be evaluated as Perl
        # code. This way I can look data up in a hash and do URL-encoding
        # as part of the regular expression without an extra step.
        $url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge;
        #print "$url\n";

        # Create an HTTP request object for this URL.
        my $request = HTTP::Request->new(GET => $url);
        # This HTTP header is required. The server outputs garbage if
        # it's not present.
        $request->push_header('Content-Type' => 'application/json');
        # Send the request and check for an error from the server.
        my $response = $ua->request($request);
        die "Error ".$response->code if !$response->is_success;
        # The response should be JSON.
        my $obj = from_json($response->content);
        my $str = "$args{LOCATION},$args{YEAR},$obj->{totalHits}\n";
        print O $str;
        print $str;
    }
}

edited Feb 07 '13 at 05:00

answered Feb 07 '13 at 03:40

nandhp

4,680
2
30
42

WOW!! Where have you been all my life? :) And, how can that be written to and from a .txt file like I initially had? Without any success, I've added before $url: in open(O, ">out26.txt"); my $oldh = select(O); $| = 1; select($oldh); – user1690130 Feb 07 '13 at 03:52
1

You also need to put the filehandle in the `print` statement, like: `print O "stuff\n";` – nandhp Feb 07 '13 at 04:05
Ok, great, thank you! I got that working. However, I am having trouble putting in the forloop. In the myurl, I replaced the ^ with % like I initially had around YEAR and location. I added foreach my $year (1923..1923) { before the $url = ~s line. with a } at the end. I'm getting: "Error 403" at die "Error ".$response->code if !$response->is_success; – user1690130 Feb 07 '13 at 04:27
Is the problem that this must be chnaged: $url =~ s/\^([A-Z]+)\^/escape(escape($args{$1}))/ge; In particular, the args must be changed? – user1690130 Feb 07 '13 at 04:30
Yes, that regular expression is still using `^`. But I designed that expression so you probably wouldn't have to modify it. It simply replaces `^WHATEVER^` with the `WHATEVER` element of `%args`. So if you put your for loop where my comment `# for each location, year, ...` is, then you can easily put the year into `%args`: `my %args = (LOCATION => '"San Diego"', YEAR => $year);`, which should do what you want. (If you want more parameters later, say `GENDER`, you just have to add `GENDER => 'm',` to `%args` and put `^GENDER^` at the right place in the URL.) – nandhp Feb 07 '13 at 04:33
But don't I need the loops to appear after open(O, ">out26.txt"); ? I want it to open that file once, and then loop over everything. – user1690130 Feb 07 '13 at 04:39
Yes, the loops should be after you open the output file. Why don't you open the output file earlier too? Alternatively, you could leave the loops where they are and update the hash on each iteration using something like `$args{YEAR} = $year`. – nandhp Feb 07 '13 at 04:40
I can't follow the comments anymore all too well. I am so grateful. But it is hard to follow when the comment are getting so long without seeing it in the code. – user1690130 Feb 07 '13 at 04:49
I've updated the code, not just with the loops and file output, but I also added some comments. – nandhp Feb 07 '13 at 05:00
Thank you for all of this! May we keep in touch? – user1690130 Feb 07 '13 at 06:29
It turns out that the site is incredibly fragile and the search crashes much more often then I've experienced elsewhere (or it could be my connection at the moment). Is it possible to add a few lines that makes the program continue onwards instead of crashing? I've done something like this before where I dump the suddenly "bad" county name into a new file and then just continue along – user1690130 Feb 07 '13 at 14:49
I can make this a new post . . . – user1690130 Feb 07 '13 at 14:50
The code was something like: open(LOCATIONS_FAIL, >>locations_fail2.txt"); open(L, "locations2.txt"; foreach my $v (@locations_array){ print LOCFILE $v; } if($consecutive_fail_count > 20){ print "fail file"; last LOCLOOP; } close LOCFILE; – user1690130 Feb 07 '13 at 14:56
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24159/discussion-between-user1690130-and-nandhp) – user1690130 Feb 08 '13 at 01:48
How did you figure out that url? I'm trying to scrape the site in a similar way but over different parameters. . . – user1690130 Feb 11 '13 at 02:41
I used the Developer Tools in Google Chrome. Open the developer tools with Menu > Tools > Developer Tools > Network panel. Then reload the page. After the page is finished loading, all of the network resources requested by the page will be listed. For this site, selecting the XHR filter at the bottom of the developer tools quickly revealed the correct URL. Opera, Firefox, and Safari all have similar tools; I don't know about Internet Explorer. – nandhp Feb 11 '13 at 14:16
Thanks! As I wrote before, I'd like to stay in touch about perl, if that would be ok. – user1690130 Feb 11 '13 at 14:45

Borodin · Answer 4 · 2013-02-10T17:36:51.990

1

This seems to do what you need. Instead of waiting for the disappearance of the hourglass it waits - more obviously I think - for the appearance of the text node you're interested in.

use 5.010;
use warnings;

use WWW::Mechanize::Firefox;

STDOUT->autoflush;

my $url = 'https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219';

my $mech = WWW::Mechanize::Firefox->new(tab => qr/FamilySearch\.org/, create => 1, activate => 1);
$mech->autoclose_tab(0);

$mech->get('about:blank');
$mech->get($url);

my $text;
while () {
  sleep 1;
  $text = $mech->xpath('//p[@class="num-search-results"]/text()', maybe => 1);
  last if defined $text;
}

my $results = $text->{nodeValue};
say $results;
if ($results =~ /([\d,]+)\s+results/) {
  (my $n = $1) =~ tr/,//d;
  say $n;
}

output

1-20 of 1,975 results
1975

Update

This update is with special thanks to @nandhp, who inspired me to look at the underlying data server that produces the data in JSON format.

Rather than making a request via the superfluous https://familysearch.org/proxy this code accesses the server directly at https://familysearch.org/search/records, reencodes the JSON and dumps the required data out of the resulting structure. This has the advantage of both speed (the requests are served about once a second - more than ten times faster than with the equivalent request from the basic web site) and stability (as you note, the site is very flaky - in contrast I have never seen an error using this method).

use strict;
use warnings;

use LWP::UserAgent;
use URI;
use JSON;

use autodie;

STDOUT->autoflush;

open my $fh, '<', 'locations26.txt';
my @locations = <$fh>;
chomp @locations;

open my $outfh, '>', 'out26.txt';

my $ua = LWP::UserAgent->new;

for my $county (@locations[36, 0..2]) {
  for my $year (1923 .. 1926) {
    my $total = familysearch_info($county, $year);
    print STDOUT "$county,$year,$total\n";
    print $outfh "$county,$year,$total\n";
  }
  print "\n";
}

sub familysearch_info {

  my ($county, $year) = @_;

  my $query = join ' ', (
    '+event_place_level_1:California',
    sprintf('+event_place_level_2:"%s"', $county),
    sprintf('+birth_year:%1$d-%1$d~', $year),
    '+gender:M',
    '+race:White',
  );

  my $url = URI->new('https://familysearch.org/search/records');
  $url->query_form(
    collection_id => 2000219,
    count => 20,
    query => $query);

  my $resp = $ua->get($url, 'Content-Type'=> 'application/json');
  my $data = decode_json($resp->decoded_content);

  return $data->{totalHits};
}

output

San Diego,1923,1975
San Diego,1924,2004
San Diego,1925,1871
San Diego,1926,1908

Alameda,1923,3577
Alameda,1924,3617
Alameda,1925,3567
Alameda,1926,3464

Alpine,1923,1
Alpine,1924,2
Alpine,1925,0
Alpine,1926,1

Amador,1923,222
Amador,1924,248
Amador,1925,134
Amador,1926,67

edited Feb 10 '13 at 17:36

answered Feb 07 '13 at 12:01

Borodin

126,100
9
70
144

Thank you very much for this code! I was hoping the output would look more like San Diego,1923,1975 on 1-line. Then I can loop over a list of counties from a .txt file (say locations.txt) and print these in an output file (say out.txt). In this structure, I'm a little unfamliar with how to do that. For example, I usually use the print command and then add print O to send it to out.txt. And then I loop as open(L, 'locations.txt') or die "Can't open locations: $!"; (setup output .txt) while (my $location = ) { $location =~ s/[\r\n]//g; my url . . . Would that work here? – user1690130 Feb 07 '13 at 15:55
Well you did say "I am interested in writing a Perl script that ... extracts the number 1975"! Give me some examples of the input data so that I can test. Presumably years and counties? – Borodin Feb 07 '13 at 16:16
Not to clutter this thread up further, but upon investigating the site, it is highly fragile. I anticipate the searches not lasting more than 100 or so iterations, whereas I have in the thousands of data points. Is there a way around that? Like to get it to keep on going even if it crashes and to just move the "failed" county to a separate .txt and to keep on going with the next one and just appen that to the output? – user1690130 Feb 07 '13 at 16:55
I intend to put something of the sort into a solution. I was hoping to simply retry failed queries until they worked. Because of other commitments I may not have time until the weekend, but I shall do my best. – Borodin Feb 07 '13 at 21:18
The code you were asking about is fairly simple. It loads the page, and then uses the XPath expression `//p[@class="num-search-results"]/text()` to check whether the required text has appeared (the page is built using JavaScript after the basic HTML has been loaded, and this takes about ten seconds). The expression looks for the text child of a `
` element that has a class attribute of `num-search-results` anywhere beneath the root of the document. If the node doesn't exist yet then the `xpath` method will return `undef` and the loop will wait for a second and repeat.
– Borodin Feb 10 '13 at 14:41
Not really. Post questions here and I will check when I can. – Borodin Feb 10 '13 at 15:54
Thank you. I will look into this. I've also been working on: http://stackoverflow.com/questions/14793092/crawling-in-order-to-scrape-a-site-that-is-passed-by-an-ajax-in-perl – user1690130 Feb 10 '13 at 16:01
Thank you so much for this code!! It runs beautifully! Is is possible to remove the my @counties = () and instead to input them from and output them to .txt files like in the initial query in code like: open(L, 'locations2.txt') or die "Can't open locations: $!"; open(O, '>', 'out2.txt') or die "Can't open output file: $!"; # Enable autoflush on the output file handle my $oldh = select(O); $| = 1; select($oldh); while (my $location = ) { . . . . . print O } – user1690130 Feb 10 '13 at 16:16
I will look at this in a second. Per your information on your profile, I would like to discuss a more long-term project – user1690130 Feb 10 '13 at 19:49
I cannot get the output file to work for some reason. It is not corresponding with the input. Although it does store it in the out26.txt file correctly. Also, why the 36? Is that because of the 36 counties? That could not be done with a while loop like in the original posting? – user1690130 Feb 10 '13 at 21:50
I had to award the bounty because it expired. I awarded it to the other user since his code followed the input, which was essential. I would gladly accept your answer if I could figure that out instead. I like how yours fixes to make the automatic loop if it is about to fail. – user1690130 Feb 10 '13 at 21:51
I hoped you could figure that out! Because I don't have a copy of your input file I just created one with all the counties. The `for` loop loops over `$locations[36]` - San Diego - and then the first three. Just change it to `for my $county (@locations) { ... }` to include everything in the data file. Let's figure out a way to communicate privately. – Borodin Feb 11 '13 at 06:10
Send me an email to temporary address `nrelgiarnou@dunflimblag.mailexpire.com` – Borodin Feb 11 '13 at 10:49

Gilles Quénot · Answer 5 · 2013-02-10T13:39:41.787

What about this simple script without firefox ? I had investigated the site a bit to understand how it works, and I saw some JSON requests with firebug firefox addon, so I know which URL to query to get the relevant stuff. Here is the code :

use strict; use warnings;
use JSON::XS;
use LWP::UserAgent;
use HTTP::Request;

my $ua = LWP::UserAgent->new();

open my $fh, '<', 'locations2.txt' or die $!;
open my $fh2, '>>', 'out2.txt' or die $!;

# iterate over locations from locations2.txt file
while (my $place = <$fh>) {
    # remove line ending
    chomp $place;
    # iterate over years
    foreach my $year (1923..1925) {
        # building URL with the variables
        my $url = "https://familysearch.org/proxy?uri=https%3A%2F%2Ffamilysearch.org%2Fsearch%2Frecords%3Fcount%3D20%26query%3D%252Bevent_place_level_1%253ACalifornia%2520%252Bevent_place_level_2%253A%2522$place%2522%2520%252Bbirth_year%253A$year-$year~%2520%252Bgender%253AM%2520%252Brace%253AWhite%26collection_id%3D2000219";
        my $request = HTTP::Request->new(GET => $url);
        # faking referer (where we comes from)
        $request->header('Referer', 'https://familysearch.org/search/collection/results');
        # setting expected format header for response as JSON
        $request->header('content_type', 'application/json');

        my $response = $ua->request($request);

        if ($response->code == 200) {
            # this line convert a JSON to Perl HASH
            my $hash = decode_json $response->content;
            my $val = $hash->{totalHits};
            print $fh2 "year $year, place $place : $val\n";
        }
        else {
           die $response->status_line;
        }
    }
}

END{ close $fh; close $fh2; }

user1690130 · Answer 6 · 2013-02-07T03:28:06.927

I do not know how to post revised code from the solution above.

This code does not (yet) compile correctly. However, I have made some essential update to definitely head in that direction.

I would very much appreciate help on this updated code. I do not know how to post this code and this follow up such that it appease the lords who run this sight.

It get stuck at the sleep line. Any advice on how to proceed past it would be much appreciated!

use strict;
use warnings;
use WWW::Mechanize::Firefox;

my $mech = WWW::Mechanize::Firefox->new(
activate => 1, # bring the tab to the foreground
);
$mech->get('https://familysearch.org/search/collection/results#count=20&query=%2Bevent_place_level_1%3ACalifornia%20%2Bevent_place_level_2%3A%22San%20Diego%22%20%2Bbirth_year%3A1923-1923~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219',':content_file' => 'main.html', synchronize => 0);

 my $retries = 10;
while ($retries-- and $mech->is_visible( xpath => '//*[@id="hourglass"]' )) {
 print "Sleep until we find the thing\n";
  sleep 2;
 };
 die "Timeout while waiting for application" if 0 > $retries;

# Now the hourglass is not visible anymore

#fill out the search form
my @forms = $mech->forms();
#<input id="census_bp" name="birth_place" type="text" tabindex="0"/>    
#A selector prefixed with '#' must match the id attribute of the input. A selector     prefixed with '.' matches the class attribute. A selector prefixed with '^' or with no     prefix matches the name attribute.
$mech->field( birth_place => 'value_for_birth_place' );
# Click on the submit
$mech->click({xpath => '//*[@class="form-submit"]'});

score 0 · Answer 7 · answered Feb 07 '13 at 08:56

You should set the current form before accessing a field:

"Given the name of a field, set its value to the value specified. This applies to the current form (as set by the "form_name()" or "form_number()" method or defaulting to the first form on the page)."

$mech->form_name( 'census-search' );
$mech->field( birth_place => 'value_for_birth_place' );

Sorry, I am not able too try this code out and thanks for open a question for a new question.

using Perl to scrape a website

7 Answers7

Linked