Scraping multiple items off of a Page into a Neat Row

Question

As an example:

I load in the input from a .txt:

Benjamin,Schuvlein,Germany,1912,M,White

I do some code that I will not post here for brevity and get to the link:

https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ

I want to scrape multiple things from that page. In the code below, I only do 1.
I'd also like to make each item be separated by a , in the output .txt.
And, I'd like the output to be preceded by the input.

I'm using the following packages in the code:

use strict;
use warnings;
use WWW::Mechanize::Firefox;
use Data::Dumper;
use LWP::UserAgent;
use JSON;
use CGI qw/escape/;
use HTML::DOM;

Here's the relevant code:

my $ua = LWP::UserAgent->new;
open(my $o, '>', 'out2.txt') or die "Can't open output file: $!";
# Here is the url, although in practice, it is scraped itself using different code
my $url = 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ'; 
print "My URL is <$url>\n";  
my $request = HTTP::Request->new(GET => $url);
  $request->push_header('Content-Type' => 'application/json');
  my $response = $ua->request($request);
 die "Error ".$response->code if !$response->is_success;
 my $dom_tree = new HTML::DOM;
 $dom_tree->write($response->content);
 $dom_tree->close;
  my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[10]->as_text();
 print $str;
print $o $str;

Desired Output (from that link) is something like:

Benjamin,Schuvlein,Germany,1912,M,White,Queens,New York,Married,Same Place,Head, etc ....

(How much of that output section is scrapable?)

Any help on how to get the link within the link would be much appreciated!

That website grabs this data using AJAX. Look at the requests. The data is returned in the JSON format. I'd also check to see if they have an API before scraping. — Blender, Feb 08 '13 at 20:29
Go on craigslist and find a freelance programmer to hire or show in the code where your having problems for advise. — Bill, Feb 08 '13 at 20:39
@Blender Thank you for your wonderful question. I updated the post to answer your question about the API. — user1690130, Feb 08 '13 at 21:11
@Blender: What makes you say the site uses AJAX? This URL is fetched with a simple HTTP `GET` and the response is `OK` with `application/html` content. — Borodin, Feb 13 '13 at 16:28

score 2 · Answer 1 · edited Apr 26 '14 at 07:01

Try this

use LWP::Simple;
use LWP::UserAgent;
use HTML::TableExtract;

$ENV{'PERL_LWP_SSL_VERIFY_HOSTNAME'} = 0;
$ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (Windows NT 5.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.91 Safari/537.11");
$req = HTTP::Request->new(GET => "https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ");
$res = $ua->request($req);
$content = $res->content;
#$content = get("https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ") or die "Couldn't get it! $!";
$te = HTML::TableExtract->new( attribs => { 'class' => 'result-data' } );
# $te = HTML::TableExtract->new( );
$te->parse($content);
$table = $te->first_table_found;
# print $content; exit;
# $te->tables_dump(1);
#print Dumper($te);
#print Dumper($table);
print $table->cell(4,0) . ' = ' . $table->cell(4,1), "\n"; exit;

Which prints out

event place: = Assembly District 2, Queens, New York City, Queens, New York, United States

I also noticed this header:

X-Copyright:COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.

See also http://metacpan.org/pod/HTML::Element#SYNOPSIS

Thank you! And how would other cells be added to that? Like the Married one, for example? — user1690130, Feb 12 '13 at 02:41
I meant: How could I get the output: Queens, New York, Married; such that they are one after the other but on the same line of the .txt. — user1690130, Feb 12 '13 at 02:47
Assign the values to variables, then `print $o "Benjamin,Schuvlein,Germany,1912,M,White,$eventPlace,$married,...\n";` — Chloe, Feb 12 '13 at 02:56
I hope you don't mind me asking how to assign a variable. . . — user1690130, Feb 12 '13 at 03:24

score 2 · Answer 2 · answered Feb 13 '13 at 16:24

This is fairly simply done using HTML::TreeBuilder::XPath to access the HTML. This program builds a hash of the data using the labels as keys, so any of the desired information can be extracted. I have enclosed in quotes any fields that contain commas or whitespace.

I don't know whether you have the permission of this web site to extract data this way, but I should draw your attention to this X-Copyright header in the HTTP responses. This approach clearly falls under the header of programmatic access.

X-Copyright: COPYRIGHT WARNING Data accessible through the FamilySearch API is protected by copyright. Any programmatic access, reformatting, or rerouting of this data, without permission, is prohibited. FamilySearch considers such unauthorized use a violation of its reproduction, derivation, and distribution rights. Contact devnet (at) familysearch.org for further information.

Am I to expect an email from you? I replied to your first mail but haven't heard since.

use strict;
use warnings;

use URI;
use LWP;
use HTML::TreeBuilder::XPath;

my $url = URI->new('https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ');

my $ua = LWP::UserAgent->new;
my $resp = $ua->get($url);
die $resp->status_line unless $resp->is_success;

my $tree = HTML::TreeBuilder::XPath->new_from_content($resp->decoded_content);
my @results = $tree->findnodes('//table[@class="result-data"]//tr[@class="result-item"]');
my %data;
for my $item (@results) {
  my ($key, $val) = map $_->as_trimmed_text, $item->content_list;
  $key =~ s/:$//;
  $data{$key} = $val;
}

my $record = join ',', map { local $_ = $data{$_}; /[,\s]/ ? qq<"$_"> : $_ }
  'name', 'birthplace', 'estimated birth year', 'gender', 'race (standardized)',
  'event place', 'marital status', 'residence in 1935',
  'relationship to head of household (standardized)';

print $record, "\n";

output

"Benjamin Schuvlein",Germany,1912,Male,White,"Assembly District 2, Queens, New York City, Queens, New York, United States",Married,"Same Place",Head

user1126070 · Answer 3 · 2013-02-11T15:40:48.027

I thought I had answered your question.

The problem is that you are trying to fetch the webpage with LWP. Why are try to doing that if you already have WWW::Mechanize::Firefox?

Did you tried this?

It will retrieve and save each link for further analyses. A small change and you 'get' the DOM tree. Sorry, I do not have acccess to this page, so I just hope it will work.

my $i=1;
for my $link (@links) {
  print Dumper $link->url;
  print Dumper $link->text;
  my $tempfile = './$i.html';$i++;
  $mech->get( $link, ':content_file' => $tempfile, synchronize => 1 );
  my $dom_tree = $mech->document();
  my $str = $dom_tree->getElementsByTagName('table')->[0]->getElementsByTagName("td")->[9]->as_text();

 }

EDIT: Process the page content with regexp (Everyone: Please remember, there is always more than one way to do something wwith Perl!. It works, it is easy...)

it tried it out with this cmd:

wget -nd 'https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ' -O 1.html|cat 1.html|1.pl

use Data::Dumper;
use strict;
use warnings;

local $/=undef;
my $html = <>;#read from file
#$html = $mech->content( format => 'html' );# read data from mech object
my $data = {};
my $current_label = "not_defined";
while ($html =~ s!(<td[^>]*>.*?</td>)!!is){ # process each TD
    my $td = $1;
    print "td: $td\n";
    my $td_val = $td;
    $td_val =~ s!<[^>]*>!!gis;
    $td_val =~ s!\s+! !gs;
    $td_val =~ s!(\A\s+|\s+\z)!!gs;
    if      ($td =~ m!result-label!){ #primitive state machine, store the current label
        print "current_label: $current_label\n";
        $current_label = $td_val;
    } elsif ($td =~ m!result-value!){ #add each data to current label
        push(@{$data->{$current_label}},$td_val);

    } else {
        warn "found something else: $td\n";
    }
}
#process it using a white lists of known entries (son,race, etc).Delete from the result if you find it on white list, die if you find something new.
#multi type
foreach my $type (qw(son wife daughter head)){
    process_multi($type,$data->{$type});
    delete($data->{$type});
}
#simple type
foreach my $type (qw(birthplace age)){
    process_simple($type,$data->{$type});
    delete($data->{$type});
}

die "Unknown label!".Dumper($data) if scalar(keys %{$data})>0;

Output:

      'line number:' => [
                          '28'
                        ],
      'estimated birth year:' => [
                                   '1912'
                                 ],
      'head' => [
                  'Benjamin Schuvlein',
                  'M',
                  '28',
                  'Germany'
                ],

I am getting an error along the lines of: MozRepl::RemoteObject: NS_ERROR_FAILURE: Component returned failure Code. Doe that mean there is a problem my MozRepl installation? — user1690130, Feb 11 '13 at 15:39
I'm getting caught now in perl thinking but not doing anything. I had to kill the cmd. — user1690130, Feb 11 '13 at 15:50
Did you started firefox before perl script? Could you please paste the full error here? — user1126070, Feb 11 '13 at 15:54
The problem is the line: my $html = <>;#read from file Perl freezes when it gets there. There is no error. Just freezes. So I kill the cmd. — user1690130, Feb 11 '13 at 15:54
"my $html = <>;#read from file" this is used by me, see how I executed the test script on my machine. you could 'merge' the two script, then your could use '$html = $mech->content( format => 'html' );' — user1126070, Feb 11 '13 at 16:56
I'm very confused now. Which is the executable code that got you that output? — user1690130, Feb 11 '13 at 17:01
$html = $mech->content( format => 'html' ) in place of my $url; ? Is that another module to install ? — user1690130, Feb 11 '13 at 17:23
let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/24330/discussion-between-user1690130-and-user1126070) — user1690130, Feb 11 '13 at 17:32
I don't think this code is doing what I asked. I want to scrape informtaion from https://familysearch.org/pal:/MM9.1.1/K3BN-LLJ whose address is found at the longer site. I think the information you are scraping in this code comes from the longer site address rather than the shorter one. — user1690130, Feb 11 '13 at 19:20

Scraping multiple items off of a Page into a Neat Row

3 Answers3