PHP cURL setting a delay after 10 requests

Question

I am using PHP and cURL to scrape the html of a single websites pages. Through experimentation I have discovered that my code only works when I specify 10 URLS or less in the $nodes array(see code sample). I need to scrape around 100 pages at once and save the source code to file. Can this be accomplished using one of cURLS inbuilt functions?

Here is the code i am using at the moment:

function getHTML(){

$nodes = array(

'http://www.example.com/page1.html',
'http://www.example.com/page2.html',
'http://www.example.com/page3.html',
'http://www.example.com/page4.html',
'http://www.example.com/page5.html',
'http://www.example.com/page6.html',
'http://www.example.com/page7.html',
'http://www.example.com/page8.html',
'http://www.example.com/page9.html',
'http://www.example.com/page10.html',
'http://www.example.com/page11.html',
'http://www.example.com/page12.html',
'http://www.example.com/page13.html',
'http://www.example.com/page14.html',
'http://www.example.com/page15.html',
'http://www.example.com/page16.html',
'http://www.example.com/page17.html',
'http://www.example.com/page18.html',
'http://www.example.com/page19.html',
'http://www.example.com/page20.html' ...and so on...

);


$node_count = count($nodes);

$curl_arr = array();
$master = curl_multi_init();

for($i = 0; $i < $node_count; $i++)
{
    $url =$nodes[$i];
    $curl_arr[$i] = curl_init($url);
    curl_setopt($curl_arr[$i], CURLOPT_RETURNTRANSFER, true);
    curl_multi_add_handle($master, $curl_arr[$i]);
}

do {
    curl_multi_exec($master,$running);
} while($running > 0);

echo "results: ";
for($i = 0; $i < $node_count; $i++)
{
    $results = curl_multi_getcontent  ( $curl_arr[$i]  );
    echo( $i . "\n" . $results . "\n");
echo 'done';

file_put_contents('SCRAPEDHTML.txt',$results, FILE_APPEND);

}
}

Thanks in advance

What happens to the subsequent requests that fail, is there an error message? — noetix, Oct 22 '12 at 11:44
If i specify more than 10 urls the browser just hangs and fails to display anything other than waiting for host. — Raj Gundu, Oct 22 '12 at 12:21

score 0 · Answer 1 · answered Oct 22 '12 at 11:46

0

Slice the array up into chunks of 10, then execute curl_multi loop multiple times

$perRequest = 10;
for($i = 0; $i < count($nodes); $i += $perRequest)
{
    $currentNodes = array_slice($nodes, 0, $perRequest);

    // Normal curl_multi code using $currentNodes
}

answered Oct 22 '12 at 11:46

CAMason

1,122
7
13

Thanks for this, just trying to figure out how to implement it. Would it be possible to show this code inserted into my code example above? – Raj Gundu Oct 22 '12 at 12:14

score 0 · Answer 2 · answered Oct 22 '12 at 12:01

0

I think is the php execution time exceeded. You can try insert "set_time_limit(300);" at the top of your php file which included function getHTML. The number "300" means this php execution time is 300 seconds.

answered Oct 22 '12 at 12:01

f91kdash

328
2
7

score 0 · Answer 3 · answered Oct 22 '12 at 12:11

0

I tried rolling curl library in my project. I hope it may be helpful.

http://code.google.com/p/rolling-curl/source/browse/trunk/RollingCurl.php

answered Oct 22 '12 at 12:11

kwelsan

1,229
1
7
18

PHP cURL setting a delay after 10 requests

3 Answers3