I've created this function that basically scrapes Technorati for blog posts and URLs to those posts. Btw, I tortured myself to find an API for this, and couldn't find one. I do feel ashamed for this scraper, but there should be an API! Anyway...
function get_technorati_bposts($kwd) {
//
global $user, $settings;
$user_id = $user->id;
if (!$user OR $user->verified != 1 OR $user->suspend != 0) {echo "no permission"; return;}
//
$items_max = $settings['scraper_num_technorati'];
$i = 0;
$p = 1;
$posts = array();
//
while ($i < $items_max) {
$url = "http://technorati.com/search?q=". urlencode($kwd) ."&return=posts&sort=relevance&topic=overall&source=blogs&authority=high&page=". $p ."";
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, TRUE);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, TRUE);
curl_setopt($ch, CURLOPT_MAXREDIRS, 3);
curl_setopt($ch, CURLOPT_USERAGENT, "Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.13) Gecko/20080311 Firefox/2.0.0.13");
curl_setopt($ch, CURLOPT_HEADER, FALSE);
curl_setopt($ch, CURLOPT_TIMEOUT, 20);
curl_setopt($ch, CURLOPT_HTTPHEADER , "Content-Type: text/xml; charset=utf-8");
$output = curl_exec($ch);
curl_close($ch);
//
$html = "";
$html = str_get_html($output);
foreach ($html->find(".search-results li") as $key => $elm) {
foreach ($elm->find(".offsite") as $url) {
//
$href = $url->href;
$parse = parse_url($href);
$domain = $parse['host'];
$match = 0;
foreach ($posts as $item) {
$href_b = $item['Url'];
$parse_b = parse_url($href_b);
$domain_b = $parse_b['host'];
if ($domain == $domain_b) {$match++;}
}
if ($match > 0) {continue;}
//
$posts[$i]['Url'] = $href;
$posts[$i]['Thumb'] = "http://api.snapito.com/web/".$settings['scraper_snapito_key']."/sc/" . $href . "?fast";
$posts[$i]['Title'] = $url->title;
//
$i++;
}
if ($items_max == $i) {break;}
}
//
$p++;
}
print_r(json_encode($posts));
//
}
Problem is from time to time, I get internal server error 500.
And log files say this:
PHP Fatal error: Call to a member function find() on a non-object in /Library/WebServer/Documents/words/lib/scraper-functions.php on line 432
Is this because cURL times out? Anything I can do to avoid this? So if cURL doesn't return anything call the function again so I get content at some point?