0

I am a Perl beginner and I am passionate about web scraping using Perl. After spending a couple of hours I wrote the code below for scraping company name, addresses and telephone number from yell.com. The script is working fine and I successfully to collected one record (1/15 from page 1).

I need your valuable suggestion regarding how can I scrape all ten companies in the first page in one go, so that I can move on to others pages of data.

use strict;

use Data::Dumper;
use LWP::Simple; # from CPAN
use JSON qw( decode_json ); # from CPAN

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();

my $header = "company_name|Address|Telphone";

open (CH, ">output.csv");

print CH "$header\n";

my $url = "http://www.yell.com/ucs/UcsSearchAction.do?keywords=Engineering+consulatants&location=United+Kingdom&scrambleSeed=13724563&searchType=&M=&bandedclarifyResults=&ssm=1";

$mech->get($url);
my $con = $mech->content();
my $res = "";

############ for company name ##########
if ( $con =~ /<a data-omniture="LIST:COMPANYNAME" href="\/biz\/ross-davy-associates-grimsby-901271213\/" itemprop="name">(.*?)<\/a>/is ) {
  $res = $1;
}
else {
  $res = "Not_Match";
}

############### for address #########
my ($add1, $add2, $add3, $add4, $add) = ("", "", "", "", "");

if ( $con =~  /<span itemprop="streetAddress">(.*?)<\/span> <span itemprop="addressLocality">(.*?)<\/span>   &#44; <span itemprop="postalCode">(.*?)<\/span> &#44; <span itemprop="addressRegion">(.*?)<\/span>/is ) {
  $add1 = $1;
  $add2 = $2;
  $add3 = $3;
  $add4 = $4;
  $add = $1.$2.$3.$$;
}
else {
  $add = "Not_Match";
}

########### telephone ##########
my $tel="";

if ( $con =~ /<li data-company-item="telephone" class="last">  Tel: <strong>(.*?)<\/strong> <\/li>/is ) {
  $tel = $1;
}
else {
  $tel = "Not_Match";
}

print "==$res===$add===$tel==\n";
print CH "$res|$add|$tel\n";
Borodin
  • 126,100
  • 9
  • 70
  • 144
user1586957
  • 149
  • 2
  • 13

1 Answers1

5

These points should help

  • Always use warnings as well as use strict

  • Always use the three-parameter form of open, test the success of every open call, and die with a string that includes the built-in variable $! so that you know why the open failed

  • Never use regular expressions for parsing HTML. There are several modules such as HTML::TreeBuilder::XPath that do the job properly and allow simple access to the contents of the data using XPath

  • Always make sure that extracting data like this is within the terms of service of the site in question.

With regard to the last point, the majority of sites prohibit any form of automated access and copying of their data. Yell.com is no different. Their conditions of use say this.

You cannot use the website ... using any automated means to monitor or copy the website or its content ...

So what you are doing opens you to the possibility of legal prosecution.

Community
  • 1
  • 1
Borodin
  • 126,100
  • 9
  • 70
  • 144
  • it would be fine if I paste the source code of page and ask you about the same question. i am just asking for learning purpose how could grabe all data. it works fine 1st top record. – user1586957 Aug 04 '13 at 08:53
  • will try `HTML::TreeBuilder::XPath`. but there are less number tutorials about it. – user1586957 Aug 04 '13 at 08:54