I hope someone can help me out with this. I'm currently writing a spider function in PHP that recursively crawls across a website (via links it finds on the site's pages) up until a pre-specified depth.
So far my spider works well for up to 2 levels of depth. My problem is when the depth is 3 or more levels down, especially on larger websites. I get a fatal memory error, which I'm thinking has to do with all of the recursive multi-processing using cURL (and also because 3 levels down on some sites can mean thousands of URLs that are processed).
Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 366030 bytes) in C:\xampp\htdocs\crawler.php on line 105
My question is about what I might be doing wrong (or what I ought to be doing) to minimize the memory consumption.
Here's what the code currently looks like, with the important areas related to memory usage left intact, and the more complex processing sections replaced with pseudo-code/comments (to make it simpler to read). Thanks!
<?php
function crawler( $urlArray, $visitedUrlArray, $depth ){
/* Recursion check
--------------- */
if( empty( $urlArray) || ( $depth < 1 ) ){
return;
}
/* Set up Multi-Handler
-------------------- */
$multiCURLHandler = curl_multi_init();
$curlHandleArray= array();
foreach( $urlArray as $url ){
$curlHandleArray[$url] = curl_init();
curl_setopt( $curlHandleArray[$url], CURLOPT_URL, $url );
curl_setopt( $curlHandleArray[$url], CURLOPT_HEADER, 0 );
curl_setopt( $curlHandleArray[$url], CURLOPT_TIMEOUT, 1000 );
curl_setopt( $curlHandleArray[$url], CURLOPT_RETURNTRANSFER , 1 );
curl_multi_add_handle( $multiCURLHandler, $curlHandleArray[$url] );
}
/* Run Multi-Exec
-------------- */
$running = null;
do {
curl_multi_exec( $multiCURLHandler, $running );
}
while ( $running > 0 );
/* Process URL pages to find links to traverse
------------------------------------------- */
foreach( $curlHandleArrayas $key => $curlHandle ){
/* Grab content from a handle and close it
--------------------------------------- */
$urlContent = curl_multi_getcontent( $curlHandle );
curl_multi_remove_handle( $multiCURLHandler, $curlHandle );
curl_close( $curlHandle );
/* Place content in a DOMDocument for easy link processing
------------------------------------------------------- */
$domDoc = new DOMDocument( '1.0' );
$success = @$domDoc -> loadHTML( $urlContent );
/* The Array to hold all the URLs to pass recursively
-------------------------------------------------- */
$recursionURLsArray = array();
/* Grab all the links from the DOMDocument and add to new URL array
---------------------------------------------------------------- */
$anchors = $domDoc -> getElementsByTagName( 'a' );
foreach( $anchors as $element ){
// ---Clean the link
// ---Check if the link is in $visited
// ---If so, continue;
// ---If not, add to $recursionURLsArray and $visitedUrlArray
}
/* Call the function recursively with the parsed URLs
-------------------------------------------------- */
$visitedUrlArray = crawler( $recursionURLsArray, $visitedUrlArray, $depth - 1 );
}
/* Close and unset variables
------------------------- */
curl_multi_close( $multiCURLHandler );
unset( $multiCURLHandler );
unset( $curlHandleArray );
return $visitedUrlArray;
}
?>