I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:
I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:
<?php
$remainingTargets = ...;
$concurrency = 30;
$multiHandle = curl_multi_init();
$targets = [];
while (count($targets) < $concurrency && count($remainingTargets) > 0) {
$target = array_shift($remainingTargets);
$alreadyChecked = ...;
if ($alreadyChecked !== false) {
continue;
}
$curl = curl_init($target);
curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
curl_setopt($curl, CURLOPT_FAILONERROR, true);
curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 4);
curl_setopt($curl, CURLOPT_TIMEOUT, 5);
curl_multi_add_handle($multiHandle, $curl);
$targets[$target] = $curl;
}
// Run loop for downloading
$running = null;
do {
curl_multi_exec($multiHandle, $running);
} while ($running);
// Harvest results
foreach ($targets as $target => $curl) {
$html = curl_multi_getcontent($curl);
curl_multi_remove_handle($multiHandle, $curl);
// Process this page
}
curl_multi_close($multiHandle);
// If done show results, or continue processing queue...
But I want to know, is it possible to do the harvesting in the "run loop" here? I imagine that would free up resources faster and run better. It seems like I want a c-style select. But curl_multi_select
does not return a specific resource.