I have a web bot and It consumes my memory so much, after a time, memory usage hits to 50%, and the process gets killed; I have no idea why memory usage is increasing like that, I did not include "para.php" which is a library for parallel curl requests. I want to know more things about web crawlers, I searched a lot, but could not find any helpful document or methods that I can use.
This is the library from which I obtained para.php.
My code:
require_once "para.php";
class crawling{
public $montent;
public function crawl_page($url){
$m = new Mongo();
$muun = $m->howto->en->findOne(array("_id" => $url));
if (isset($muun)) {
return;
}
$m->howto->en->save(array("_id" => $url));
echo $url;
echo "\n";
$para = new ParallelCurl(10);
$para->startRequest($url, array($this,'on_request_done'));
$para->finishAllRequests();
preg_match_all("(<a href=\"(.*)\")siU", $this->montent, $matk);
foreach($matk[1] as $longu){
$href = $longu;
if (0 !== strpos($href, 'http')) {
$path = '/' . ltrim($href, '/');
if (extension_loaded('http')) {
$href = http_build_url($url, array('path' => $path));
} else {
$parts = parse_url($url);
$href = $parts['scheme'] . '://';
if (isset($parts['user']) && isset($parts['pass'])) {
$href .= $parts['user'] . ':' . $parts['pass'] . '@';
}
$href .= $parts['host'];
if (isset($parts['port'])) {
$href .= ':' . $parts['port'];
}
$href .= $path;
}
}
$this->crawl_page($longu);
}
}
public function on_request_done($content) {
$this->montent = $content;
}
$moj = new crawling;
$moj->crawl_page("http://www.example.com/");