0

I have problem to parse data with my crawler I'm writting in perl from freebase. I'm trying to pull out data from this URL:

(example)

http://www.freebase.com/authority/imdb/title?ns&lang=en&filter=%2Ftype%2Fnamespace%2Fkeys&timestamp=2013-11-20&timestamp=2013-11-21

It is page with IMDB_ID's and MID's. I'm trying to extract links. Problem is I have only 100 results and when I reach 'bottom of page' in Mozilla Firefox I get more results (11 more). I'm using LWP::UserAgent.

Anybody knows solution with some sample code, how to automatically pull out all 111 links of MID's from this page.

Here is my code:

#!/usr/bin/perl


    use LWP::Simple;
    use LWP::UserAgent;
    use HTTP::Request;
    use HTTP::Response;
    use HTML::LinkExtor;


    $URL = 'http://www.freebase.com/authority/imdb/title?ns&lang=en&    filter=%2Ftype%2Fnamespace%2Fkeys&timestamp=2013-11-20&timestamp=2013-11-21'; #URL

    $browser = LWP::UserAgent->new();
    $browser->timeout(10);

    my $request = HTTP::Request->new(GET => $URL);
    my $response = $browser->request($request);
    if ($response->is_error()) {printf "%s\n", $response->status_line;}

    $contents = $response->content();
    my ($page_parser) = HTML::LinkExtor->new(undef, $URL);
    $page_parser->parse($contents)->eof;
    @links = $page_parser->links;

    foreach $link (@links) {
        $_ = $$link[2];

        # if (index($$link[2], $_) != -1) {
        $mid = $$link[2];# if m/http:\/\/www\.freebase\.com\/m\//;


        #$mid =~ s/\?links=//; 
        #$mid =~ s/http:\/\/www.freebase.com\///; 

        print "MID $mid\n";
    }
Gabs00
  • 1,869
  • 1
  • 13
  • 12
  • 1
    LWP::UserAgent doesn't handle JS-heavy pages very well. You might have an easier time using Freebase's [HTTP API](https://developers.google.com/freebase/v1/getting-started). – rutter Dec 11 '13 at 00:57
  • I'd like to do it with some perl module. Anybody can give me some code sample. – user3085049 Dec 11 '13 at 01:02
  • You could use any HTTP client library, more than likely. The key difference is whether you're handling data that's [formatted for people to use](http://en.wikipedia.org/wiki/Special:Log) or [more for robots to use](http://en.wikipedia.org/w/api.php?action=query&list=logevents&format=xml). – rutter Dec 11 '13 at 01:12
  • I'd like to extract all MID from this URL. http://www.freebase.com/authority/imdb/title?ns&lang=en&filter=%2Ftype%2Fnamespace%2Fkeys&timestamp=2013-11-20&timestamp=2013-11-21 – user3085049 Dec 11 '13 at 01:16
  • And if possible data should be formated for people to use. Some code example would be good. – user3085049 Dec 11 '13 at 01:22

1 Answers1

1

Crawling freebase.com will likely get you blocked. As was mentioned in the comments, Freebase offers both a RESTful JSON API for light/medium duty use or interactive queries and a bulk download of the entire database for heavy consumers.

Tom Morris
  • 10,490
  • 32
  • 53
  • I only need to crawl once a day this page, to update my database. Any code sample would be good or some query example to extract all data from this URL: http://www.freebase.com/authority/imdb/title?ns&lang=en&filter=%2Ftype%2Fnamespace%2Fkeys&timestamp=2013-11-20&timestamp=2013-11-21 – user3085049 Dec 11 '13 at 01:48
  • By data I mean all MID's from webpage – user3085049 Dec 11 '13 at 01:51