0

I hope someone can help me out with this. I'm currently writing a spider function in PHP that recursively crawls across a website (via links it finds on the site's pages) up until a pre-specified depth.

So far my spider works well for up to 2 levels of depth. My problem is when the depth is 3 or more levels down, especially on larger websites. I get a fatal memory error, which I'm thinking has to do with all of the recursive multi-processing using cURL (and also because 3 levels down on some sites can mean thousands of URLs that are processed).

Fatal error: Allowed memory size of 134217728 bytes exhausted (tried to allocate 366030 bytes) in C:\xampp\htdocs\crawler.php on line 105

My question is about what I might be doing wrong (or what I ought to be doing) to minimize the memory consumption.

Here's what the code currently looks like, with the important areas related to memory usage left intact, and the more complex processing sections replaced with pseudo-code/comments (to make it simpler to read). Thanks!

<?php

function crawler( $urlArray, $visitedUrlArray, $depth ){

    /* Recursion check 
       --------------- */
    if( empty( $urlArray) || ( $depth < 1 ) ){
        return;
    }

    /* Set up Multi-Handler 
       -------------------- */
    $multiCURLHandler = curl_multi_init();      
    $curlHandleArray= array();

    foreach( $urlArray as $url ){
        $curlHandleArray[$url] = curl_init();
        curl_setopt( $curlHandleArray[$url], CURLOPT_URL, $url );
        curl_setopt( $curlHandleArray[$url], CURLOPT_HEADER, 0 );
        curl_setopt( $curlHandleArray[$url], CURLOPT_TIMEOUT, 1000 );
        curl_setopt( $curlHandleArray[$url], CURLOPT_RETURNTRANSFER , 1 );  
        curl_multi_add_handle( $multiCURLHandler, $curlHandleArray[$url] );
    }

    /* Run Multi-Exec 
       -------------- */
    $running = null;
    do {
        curl_multi_exec( $multiCURLHandler, $running );
    }
    while ( $running > 0 );


    /* Process URL pages to find links to traverse
       ------------------------------------------- */
    foreach( $curlHandleArrayas $key => $curlHandle ){


        /* Grab content from a handle and close it
           --------------------------------------- */
        $urlContent = curl_multi_getcontent( $curlHandle );
        curl_multi_remove_handle( $multiCURLHandler, $curlHandle );
        curl_close( $curlHandle );          


        /* Place content in a DOMDocument for easy link processing
           ------------------------------------------------------- */
        $domDoc = new DOMDocument( '1.0' );
        $success = @$domDoc -> loadHTML( $urlContent );


        /* The Array to hold all the URLs to pass recursively
           -------------------------------------------------- */    
        $recursionURLsArray = array();


        /* Grab all the links from the DOMDocument and add to new URL array
           ---------------------------------------------------------------- */
        $anchors = $domDoc -> getElementsByTagName( 'a' );
        foreach( $anchors as $element ){
            // ---Clean the link
            // ---Check if the link is in $visited
            //    ---If so, continue;
            //    ---If not, add to $recursionURLsArray and $visitedUrlArray
        }


        /* Call the function recursively with the parsed URLs
           -------------------------------------------------- */
        $visitedUrlArray = crawler( $recursionURLsArray, $visitedUrlArray, $depth - 1 );

    }


    /* Close and unset variables
       ------------------------- */
    curl_multi_close( $multiCURLHandler );
    unset( $multiCURLHandler );
    unset( $curlHandleArray );

    return $visitedUrlArray;
}
?>
Zero Wing
  • 233
  • 2
  • 12

1 Answers1

1

This is your problem:

 "I'm currently writing a spider function in PHP that recursively crawls across a website"

Don't do that. You are going to get into an infinite loop and cause a denial of service. Your real problem is not running out of memory. Your real problem is that you are going to take down the sites you are crawling.

Real webspiders do not attack your website and hit every page boom boom boom like you are doing. The way you are doing it is more like an attack than a legitimate webcrawler. They are called "Crawlers" because they "crawl" as in "go slow." Plus, a legitimate webcrawler will read the robots.txt file and not read pages that are off limits according to that file.

You should do it more like this:

  1. Read ONE page and save the links to a database where the URL has a UNIQUE constraint so you don't get the same one in there more than once. This table should also have a status field to show if the url has been read or not.

  2. Grab a URL from the database where the status field shows its unread. Read it, save the urls it links to into the database. Update a status field on the database to show that its been read.

Repeat #2 as needed..but at the pace of a crawl.

From http://en.wikipedia.org/wiki/Web_crawler#Politeness_policy :

Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes.

developerwjk
  • 8,619
  • 2
  • 17
  • 33