0

I've created this function that basically scrapes Technorati for blog posts and URLs to those posts. Btw, I tortured myself to find an API for this, and couldn't find one. I do feel ashamed for this scraper, but there should be an API! Anyway...

function get_technorati_bposts($kwd) {
        //
        global $user, $settings;
        $user_id = $user->id;
        if (!$user OR $user->verified != 1 OR $user->suspend != 0) {echo "no permission"; return;}
        //
        $items_max = $settings['scraper_num_technorati'];
        $i = 0;
        $p = 1;
        $posts = array();
        //
        while ($i < $items_max) {
            $url = "http://technorati.com/search?q=". urlencode($kwd) ."&return=posts&sort=relevance&topic=overall&source=blogs&authority=high&page=". $p ."";
            $ch = curl_init();
            curl_setopt($ch, CURLOPT_URL, $url);
            curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
            curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
            curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
            curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
            curl_setopt($ch, CURLOPT_HEADER, FALSE);
            curl_setopt($ch, CURLOPT_TIMEOUT, 20);
            curl_setopt($ch, CURLOPT_HTTPHEADER , "Content-Type: text/xml; charset=utf-8");
            $output = curl_exec($ch);
            curl_close($ch);
            //
            $html = "";
            $html = str_get_html($output);

            foreach ($html->find(".search-results li") as $key => $elm) {
                foreach ($elm->find(".offsite") as $url) {
                    //
                    $href = $url->href;
                    $parse = parse_url($href);
                    $domain = $parse['host'];
                    $match = 0;
                    foreach ($posts as $item) {
                        $href_b = $item['Url'];
                        $parse_b = parse_url($href_b);
                        $domain_b = $parse_b['host'];
                        if ($domain == $domain_b) {$match++;}
                    }
                    if ($match > 0) {continue;}
                    //
                    $posts[$i]['Url'] = $href;
                    $posts[$i]['Thumb'] = "http://api.snapito.com/web/".$settings['scraper_snapito_key']."/sc/" . $href . "?fast";
                    $posts[$i]['Title'] = $url->title;
                    //
                    $i++;
                }
                if ($items_max == $i) {break;}
            }
            //
            $p++;
        }
        print_r(json_encode($posts));
        //
    }

Problem is from time to time, I get internal server error 500.

And log files say this:

PHP Fatal error: Call to a member function find() on a non-object in /Library/WebServer/Documents/words/lib/scraper-functions.php on line 432

Is this because cURL times out? Anything I can do to avoid this? So if cURL doesn't return anything call the function again so I get content at some point?

halfer
  • 19,824
  • 17
  • 99
  • 186
  • 1
    If you want to scrape in PHP, consider using Goutte, which wraps cURL nicely. Or if you just want to scrape, and don't actually need to do it in code, look at the free desktop software offered by import.io. – halfer Nov 28 '13 at 20:58
  • @halfer that lib looks fantastic! –  Nov 28 '13 at 21:34
  • Goutte? Yes indeed; I'm working on a largish scraping project atm, and it's nothing short of brilliant. Since it uses Guzzle internally, you get all the Symfony2 Event goodness too. – halfer Nov 28 '13 at 21:37

1 Answers1

1

(general advice) Always check your return values for errors:

$output = curl_exec($ch);
if($output === FALSE) {
    // when output is false it can't be used in str_get_html()
    // output a proper error message in such cases
    die(curl_error($ch));
}

... and always read the manual if a functions fails. :) .. There is a section called 'return value' for every function.


Btw, why do you initializing $html = ""; as empty string if you initializing it again on the next line?

hek2mgl
  • 152,036
  • 28
  • 249
  • 266