0

So I am connecting to the https://genderize.io/ API. I want to scrape from this API as fast as possible because I might need to do 1,000,000 of searches at a time. Is it possible to attach 100,000 (10 names per request) different curl_init headers with different parameters and then execute them all in parallel? It seems too good to be true if i could. Also if I can't do this how else can I speed up the requests. My current code is using one instance of curl_init and changing the URL for each cycle in a for loop. Here is my current loop:

$ch3 = curl_init();
for($x = 0; $x < $loopnumber; $x = $x + 10){
    $test3 = curl_setopt_array($ch3, array(
        CURLOPT_RETURNTRANSFER => 1,
        CURLOPT_URL => 'https://api.genderize.io?name[0]=' . $firstnames[$x] . '&name[1]=' . $firstnames[$x+1] . '&name[2]=' . $firstnames[$x+2] . '&name[3]=' . $firstnames[$x+3] . '&name[4]=' . $firstnames[$x+4] . '&name[5]=' . $firstnames[$x+5] . '&name[6]=' . $firstnames[$x+6] . '&name[7]=' . $firstnames[$x+7] . '&name[8]=' . $firstnames[$x+8] . '&name[9]=' . $firstnames[$x+9]
    ));
    $resp3 = curl_exec($ch3);
    echo $resp3;
    $genderresponse = json_decode($resp3,true);
EdTheSped
  • 35
  • 5
  • 2
    "The API is free, but limited at 1000 names/day. " "The API is limited to a maximum of 10 names per request" –  May 04 '16 at 20:51
  • `curl_multi_*` may help, but my guess is that the real bottleneck would be with the API. If you hit them with hundreds of thousands of calls at once, it may not save you much time. You'll have to test it yourself to see. – WillardSolutions May 04 '16 at 20:53
  • @Dagon I see that, I am going to pay for the service and get more than 1000 names per day. I was wondering if I requested in parallel if each request can have 10 names and be allowed to execute at the same time. – EdTheSped May 04 '16 at 20:54
  • 1
    you could test it and find out. –  May 04 '16 at 20:55

1 Answers1

0

TL;DR

Yes, it is possible - in theory. But no, it won't work in practice. You better stay within a few hundred parallel connections.

The longer story

You will probably run out of sockets and possibly memory before you can create one million easy handles and add them to a libcurl multi handle.

If you intend to communicate with the single same remote IP and port number and you only have one local IP address, and as each connection needs its own local port number you can't do more than 64K theoretic connections in parallel. You won't even get to 64K on most default configured operating systems. (You can do more if you speak to more remote IPs or have more local IPs to bind the connections to.)

For the sake of this argument, if we assume you actually get up to 60K simultaneous connections, then you'll find out that the curl_multi_* API gets to a crawling speed with that many connections as it is select/poll based. libcurl itself has an event-based API that is the recommended one when you go beyond perhaps a few hundred parallel connections, but from within PHP you have no way to access nor use that.

Daniel Stenberg
  • 54,736
  • 17
  • 146
  • 222