-2

I have a web bot and It consumes my memory so much, after a time, memory usage hits to 50%, and the process gets killed; I have no idea why memory usage is increasing like that, I did not include "para.php" which is a library for parallel curl requests. I want to know more things about web crawlers, I searched a lot, but could not find any helpful document or methods that I can use.

This is the library from which I obtained para.php.

My code:

require_once "para.php";

class crawling{

public $montent;


public function crawl_page($url){

    $m = new Mongo();

    $muun = $m->howto->en->findOne(array("_id" => $url));

    if (isset($muun)) {
        return;
    }

    $m->howto->en->save(array("_id" => $url));

    echo $url;

    echo "\n";

    $para = new ParallelCurl(10);

    $para->startRequest($url, array($this,'on_request_done'));

    $para->finishAllRequests();

    preg_match_all("(<a href=\"(.*)\")siU", $this->montent, $matk);

    foreach($matk[1] as $longu){
        $href = $longu;
        if (0 !== strpos($href, 'http')) {
            $path = '/' . ltrim($href, '/');
            if (extension_loaded('http')) {
                $href = http_build_url($url, array('path' => $path));
            } else {
                $parts = parse_url($url);
                $href = $parts['scheme'] . '://';
                if (isset($parts['user']) && isset($parts['pass'])) {
                    $href .= $parts['user'] . ':' . $parts['pass'] . '@';
                }


                $href .= $parts['host'];
                if (isset($parts['port'])) {
                    $href .= ':' . $parts['port'];
                }
                $href .= $path;
            }
        }
        $this->crawl_page($longu);
    }
}

public function on_request_done($content) {
    $this->montent = $content;
}


$moj = new crawling;
$moj->crawl_page("http://www.example.com/");
halfer
  • 19,824
  • 17
  • 99
  • 186
  • What was all that harry potter business you deleted? – brbcoding May 14 '13 at 18:37
  • @brbcoding I was kinda angry, I could not meet the quality standards, after I met the standards, I removed those lines – sick of this quality stantard May 14 '13 at 18:39
  • Most likely you are getting the bane of PHP gc: redundant references to variables. – Sammaye May 14 '13 at 19:22
  • How much memory is 50%? What sort of server are you running it on (shared server, or VPS/dedicated)? How many crawlers are you running in parallel? (I presume this library lets you set how many should be instantiated). – halfer May 14 '13 at 20:01
  • (Btw, if you've chosen your nick in frustration, it'd probably be better to change it to something else, since some people may downvote just based on that). – halfer May 14 '13 at 20:02
  • @halfer it is 512mb, I am running it on a vps, firstly, usage starts with %4 and increases over time, should I unset some variables ? I have no idea what to do here.Only 1 crawler is running – sick of this quality stantard May 14 '13 at 20:19
  • @halfer I should support my opinion, people may downwote me or find it weird, I find this quality stantard algorithm weird. – sick of this quality stantard May 14 '13 at 20:22
  • @Sammaye those are the gc functions I found, you think It will be okay ? gc_enable(); // Enable Garbage Collector var_dump(gc_enabled()); // true var_dump(gc_collect_cycles()); // # of elements cleaned up gc_disable(); // Disable Garbage Collector – sick of this quality stantard May 14 '13 at 20:28
  • I believe the quality system largely has the support of the community - [read these posts](http://meta.stackoverflow.com/search?q=quality) if you get a moment, and ask a question over there about it if you like. – halfer May 14 '13 at 22:24
  • You could add some trace statements to this library, in particular using [this statement](http://php.net/manual/en/function.memory-get-peak-usage.php) to see where memory is peaking. Also, read the docs and bug tickets for that library in case there are memory directives and/or bug reports that are worth reading. Can you set a memory limit in PHP to see if that makes a difference? What is your `memory_limit` at the moment in php.ini? – halfer May 14 '13 at 22:27
  • @halfer it is 256 MB at the moment – sick of this quality stantard May 15 '13 at 04:57
  • 512M is very small for a crawler. Remember that at least haf that will be taken by your OS. Can you bump it up to 1G? Maybe run the crawler from your dev machine for a few hours, to see what it needs? – halfer May 15 '13 at 12:43
  • @halfer I am a student and I don't have much money :)), even 5 more dollars, but thank you, you helped me a lot. – sick of this quality stantard May 15 '13 at 16:41
  • @sick: it might be terrible, but you can get a 1G machine in New York for 48USD per annum, or 7USD per month - [browse offers here](http://www.lowendbox.com/). Don't use it for anything mission-critical of course! For me, I run a 512M box in the UK for less than 5GBP per month, and the service is mainly excellent. – halfer May 15 '13 at 16:43

1 Answers1

0

You call this crawl_page function on 1 url. It's content is fetched ($this->montent) and checked for links ($matk).

While these are not yet destroyed, you go recursive, starting a new call to crawl_page. $this->moment will be overwritten with the new content (that's ok). A bit further down, $matk (a new variable) is populated with the links for the new $this->montent. At this point, there are 2 $matk's in memory: the one with all links for the document you started processing first, and the one with all links for the document that was first linked to in your original document.

I'd suggest to find all links & save them to a database (instead of immediately going recursive). Then just clear the queue of links in the database, 1 by 1 (with each new document adding a new entry to the database)

matthiasmullie
  • 2,063
  • 15
  • 17