0

I'm trying to load around 30,000 URLs in PHP. To complete this task as quickly as possible I'm trying to use curl_multi_init(). However it appears to be loading all 30,000 at once where as my understanding was it would process 10 at a time unless otherwise specified by CURLMOPT_MAXCONNECTS.

I believe it's trying to load all 30,000 at once because the code runs for about 8 seconds (the timeout set below) and then returns empty content for most of the URLs, as if the requests failed.

The code runs as expected for a smaller amount of domains, e.g under 100.

How can I ensure it only processes 10 requests at a time?

    $mh = curl_multi_init();

    $requests = [];
    foreach ($urls as $i => $url) {
        $requests[$i] = curl_init($url);
        curl_setopt($requests[$i], CURLOPT_RETURNTRANSFER, true);
        curl_setopt($requests[$i], CURLOPT_TIMEOUT, 8);
        curl_setopt($requests[$i], CURLOPT_CONNECTTIMEOUT, 5);
        curl_setopt($requests[$i], CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($requests[$i], CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($requests[$i], CURLOPT_HEADER, false);
        curl_setopt($requests[$i], CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($requests[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36');
        curl_multi_add_handle($mh, $requests[$i]);
    }
    
    $active = null;
    
    do {
        curl_multi_exec($mh, $active);
    } while ($active);
    
    
    $responses = [];
    foreach ($requests as $request) {
        $responses[] = curl_multi_getcontent($request);
        curl_multi_remove_handle($mh, $request);
        curl_close($request);
    }
Mr J
  • 2,655
  • 4
  • 37
  • 58
  • feed batches of 100 to the operation – symcbean Jul 19 '23 at 21:24
  • if batching doesn't help, it could be that there is some kind of rate limiting on the server you're requesting from. A useful next step might be to check for the HTTP response status codes. e.g. a code 429, may signal a need to pause and retry later. – SimonMayer Jul 19 '23 at 21:41
  • 1
    The CURLMOPT_MAXCONNECTS is limit for cache actually (out of memory guard), by default it increase indefinitely by adding new handles, when the amount of open curl handles reach the limit, the oldest gets closed immediately, so you basically discard first 29990 handles, by having that option set to 10. I believe you are looking for CURLMOPT_MAX_TOTAL_CONNECTIONS which is limit of active simultaneous connection, which is by default infinite. Btw some servers have scrapping protection, so when you do more request per limit per time you will get only empty responses. – Kazz Jul 19 '23 at 21:51
  • 1
    You're making the unwarranted assumption that all your cURL requests will work every time. You shouldn't make that assumption for a single request, let alone 30,000 of them. Add some checking to the requests to verify that the cURL request has succeeded and that the return status is 200 (or some other valid value). Then you'll be in a better position to determine the problem, rather than asking us to guess at it. – Tangentially Perpendicular Jul 20 '23 at 01:50
  • 1
    @kazz thanks yes you are correct. CURLMOPT_MAX_TOTAL_CONNECTIONS is the setting I was after. However I've ended up using the solution below from SlamJamminton because CURLOPT_TIMEOUT includes the time spent in the curl_multi_init() queue waiting for the connection to start when using CURLMOPT_MAX_TOTAL_CONNECTIONS. So on a list of 30k most were timing out before they'd even started. – Mr J Jul 20 '23 at 15:07

1 Answers1

1

Give this a try. It splits $urls into 100 element arrays, and sends a multi request for each group of 100.

$chunks = array_chunk($urls,100);
foreach($chunks as $chunk) {
    $mh = curl_multi_init();
    $responses = [];
    $requests = [];
    foreach ($urls as $i => $url) {
        $requests[$i] = curl_init($url);
        curl_setopt($requests[$i], CURLOPT_RETURNTRANSFER, true);
        curl_setopt($requests[$i], CURLOPT_TIMEOUT, 8);
        curl_setopt($requests[$i], CURLOPT_CONNECTTIMEOUT, 5);
        curl_setopt($requests[$i], CURLOPT_SSL_VERIFYHOST, false);
        curl_setopt($requests[$i], CURLOPT_SSL_VERIFYPEER, false);
        curl_setopt($requests[$i], CURLOPT_HEADER, false);
        curl_setopt($requests[$i], CURLOPT_FOLLOWLOCATION, TRUE);
        curl_setopt($requests[$i], CURLOPT_USERAGENT, 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36');
        curl_multi_add_handle($mh, $requests[$i]);
    }
    
    $active = null;
    
    do {
        curl_multi_exec($mh, $active);
    } while ($active);
    
    

    foreach ($requests as $request) {
        $responses[] = curl_multi_getcontent($request);
        curl_multi_remove_handle($mh, $request);
        curl_close($request);
    }
}