1

I'm loading data from a .txt for the purposes of scraping. However, the URL requires that I break that variable up and do +/- 2 to it. For example, if the value is 2342, I need to create 2340 and 2344 for the purposes of the URL.

I took a guess at how to break it up:

 $args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);

How do I then put it in the URL?

Here's the relevant part of the code:

  use strict;
  use warnings;
  use WWW::Mechanize::Firefox;
  use Data::Dumper;
  use LWP::UserAgent;
   use JSON;
  use CGI qw/escape/;
  use HTML::DOM;

  open(my $l, 'locations2.txt') or die "Can't open locations: $!";

 while (my $line = <$l>) {
    chomp $line;
     my %args;
     @args{qw/givenname surname birth_place birth_year gender race/} = split /,/, $line;
     $args{birth_year} = ($args{birth_year} - 2) . '-' . ($args{birth_year} + 2);
      my $mech = WWW::Mechanize::Firefox->new(create => 1, activate => 1);
     $mech->get("https://familysearch.org/search/collection/index#count=20&query=%2Bgivenname%3A$args{givenname}20%2Bsurname%3A$args{surname}20%2Bbirth_place%3A$args{birth_place}%20%2Bbirth_year%3A1910-1914~%20%2Bgender%3A$args{gender}20%2Brace%3A$args{race}&collection_id=2000219");

For Example

Input is:

Benjamin,Schuvlein,Germany,1912,M,White

Desired URL is:

https://familysearch.org/search/collection/index#count=20&query=%2Bgivenname%3ABenjamin%20%2Bsurname%3ASchuvlein%20%2Bbirth_place%3AGermany%20%2Bbirth_year%3A1910-1914~%20%2Bgender%3AM%20%2Brace%3AWhite&collection_id=2000219

slm
  • 15,396
  • 12
  • 109
  • 124
user1690130
  • 500
  • 2
  • 7
  • 26
  • I don't quite understand. Can you show input data, and expected output? –  Feb 11 '13 at 23:39
  • @depesz great question! Just added clarification. Please let me know if you have further questions. – user1690130 Feb 11 '13 at 23:45
  • Is there something more to this than creating the URL with sprintf where you put have ...%d-%d... and use $args{birth_year} - 2 for the first placeholder and $args{birth_year} + 2 in the second? – David M Feb 11 '13 at 23:49
  • I don't quite understand what is the problem you're having. You have the value in variable. You also do substitute variables already in your $mech->get() call, so what exactly is missing? –  Feb 11 '13 at 23:50
  • @DavidM I wrote something to that effect. Didn't I? I'm not sure how to put that into the url. – user1690130 Feb 11 '13 at 23:51
  • @depesz The problem is I don't know how to put the input into the url when it is from a string that I divide into 2+ variables, particularly when 1 variable must be divided further. – user1690130 Feb 11 '13 at 23:52
  • You have heaps of errors in your $mech->get line, for example: $args{givenname}20 is missing a percent sign before the 20. – Myforwik Feb 18 '13 at 22:37

3 Answers3

3

Why can't you just change this line:

$mech->get("https://familysearch.org/search/collection/index#count=20&query=%2Bgivenname%3A$args{givenname}20%2Bsurname%3A$args{surname}20%2Bbirth_place%3A$args{birth_place}%20%2Bbirth_year%3A1910-1914~%20%2Bgender%3A$args{gender}20%2Brace%3A$args{race}&collection_id=2000219");

to this:

$mech->get("https://familysearch.org/search/collection/index#count=20&query=%2Bgivenname%3A$args{givenname}20%2Bsurname%3A$args{surname}20%2Bbirth_place%3A$args{birth_place}%20%2Bbirth_year%3A$args(birth_year)~%20%2Bgender%3A$args{gender}20%2Brace%3A$args{race}&collection_id=2000219");

NOTE: I changed this bit:

%3A1910-1914~%20

to this:

%3A$arg(birth_year)~%20
slm
  • 15,396
  • 12
  • 109
  • 124
0

One way to do it:

file content:
link1
link2
...
linkn

use Data::Dumper;
use strict;
use warnings;

local $/=undef;
open(FILE,'<',$filename) or die $filename;
my $i = 1;
while (my $line = <FILE>){
  chomp($line);
  print "line: $line\n";
  my $tempfile = './$i.html';$i++;
  $mech->get( $line, ':content_file' => $tempfile, synchronize => 1 );
}
user1126070
  • 5,059
  • 1
  • 16
  • 15
0

This answer doesn't consider whether the data in the input needs to be URL-encoded, i.e. somewhere along the way if a surname is "von Schtupp" it needs to become "von%20Schtupp"

I didn't test this, so there may be a typo or minor error. Nevertheless, it's the approach that I would use. My answer also assumes that you don't care what order in which the search criteria appear.

my %query_params = (
    givenname => $args{givenname},
    surname   => $args{surname},
    birth_place => $args{birth_place},
    birth_year => sprintf("%d-%d", $args{birth_year} - 2, $args{birth_year} + 2),
    gender     => $args{gender},
    race       => $args{race},
);
my $query_parameter = join '%20',
                      map { "%2B$_%3A$query_params{$_}" }
                      keys %query_params;
my $url = "https//familysearch.org/search/collection/index#count=20&query=" .
          $query_parameter . "&collection_id=2000219";
David M
  • 4,325
  • 2
  • 28
  • 40
  • I will look at this right away. I should say for the moment, I am not worry about the space, but it is something to think about. – user1690130 Feb 12 '13 at 01:53
  • Did I implement it incoorectly? I get the error: Argument "1910-1914" isn't numeric in subtraction (-) – user1690130 Feb 12 '13 at 02:13