0

I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:

I am building a simple web spider using built-in PHP cURL multi. It works great. Here is the basic implementation:

<?php
$remainingTargets = ...;
$concurrency = 30;

$multiHandle = curl_multi_init();
$targets = [];
while (count($targets) < $concurrency && count($remainingTargets) > 0) {
  $target = array_shift($remainingTargets);
  $alreadyChecked = ...;
  if ($alreadyChecked !== false) {
    continue;
  }
  $curl = curl_init($target);
  curl_setopt($curl, CURLOPT_USERAGENT, 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.87 Safari/537.36');
  curl_setopt($curl, CURLOPT_FAILONERROR, true);
  curl_setopt($curl, CURLOPT_RETURNTRANSFER, true);
  curl_setopt($curl, CURLOPT_CONNECTTIMEOUT, 4);
  curl_setopt($curl, CURLOPT_TIMEOUT, 5);
  curl_multi_add_handle($multiHandle, $curl);
  $targets[$target] = $curl;
}

// Run loop for downloading
$running = null;
do {
  curl_multi_exec($multiHandle, $running);
} while ($running);

// Harvest results
foreach ($targets as $target => $curl) {
  $html = curl_multi_getcontent($curl);
  curl_multi_remove_handle($multiHandle, $curl);
  // Process this page
}
curl_multi_close($multiHandle);

// If done show results, or continue processing queue...

But I want to know, is it possible to do the harvesting in the "run loop" here? I imagine that would free up resources faster and run better. It seems like I want a c-style select. But curl_multi_select does not return a specific resource.

Barmar
  • 741,623
  • 53
  • 500
  • 612
William Entriken
  • 37,208
  • 23
  • 149
  • 195
  • 1
    If you do the harvesting inside the loop you lose the benefit of `curl_multi`, because you'll wait for each request to finish before starting the next one. – Barmar Jan 30 '23 at 16:44
  • Ideally I want to send all the requests. Then after they are sent process as received. I think currently I am sending all the requests and then processing after all received. – William Entriken Feb 01 '23 at 18:08
  • see [`curl_multi_select()`](https://www.php.net/manual/en/function.curl-multi-select.php) to wait for a response from any of the connections. – Barmar Feb 01 '23 at 18:16

0 Answers0