2

Because crawling the web can cost a lot of time I want to let pcntl_fork() help me in creating multiple childs to split my code in parts.

  1. Master - crawling the domain
  2. Child - When receiving a link child must crawl the link found on the domain
  3. Child - Must do the same as 2. when receiving new link.

Can i make as many as i want, or do i have to set a maximum of childs?

Here's my code:

class MyCrawler extends PHPCrawler 
{


  function handlePageData(&$page_data) 
  { // CHECK DOMEIN
$domain = $_POST['domain'];
$keywords = $_POST['keywords'];
//$tags = get_meta_tags($page_data["url"]);
//$iKeyFound = null;


$find = $keywords;
$str = file_get_contents($page_data["url"]);
if(strpos($str, $find) == true && $page_data["received"] == true)
{           
    $keywords = $_POST['keywords'];
    if($page_data["header"]){
    echo "<table border='1' >";
    echo "<tr><td width='300'>Status:</td><td width='500'> ".strtok($page_data["header"], "\n")."</td></tr>";}
    else "<table border='1' >";

    // PRINT EERSTE LIJN

    echo "<tr><td>Page requested:</td><td> ".$page_data["url"]."</td></tr>";
    // PRINT STATUS WEBSITE

    // PRINT WEBPAGINA
    echo "<tr><td>Referer-page:</td><td> ".$page_data["referer_url"]."</td></tr>";

    // CONTENT ONTVANGEN?
    if ($page_data["received"]==true)
      echo "<tr><td>Content received: </td><td>".$page_data["bytes_received"] / 8 . " Kbytes</td></tr></table>";
    else
      echo "<tr><td>Content:</td><td> Not received</td></tr></table>";


    $domain = $_POST['domain'];
    $link = mysql_connect('localhost', 'crawler', 'DRZOIDBERGGG');

    if (!$link) 
    {
        die('Could not connect: ' . mysql_error());
    }

    mysql_select_db("crawler");
    if(empty($page_data["referer_url"]))
    $page_data["referer_url"] = $page_data["url"];

    strip_tags($str, '<p><b>');
    $matches = $keywords;
    //$match = preg_match_all("'/<(*.?)(*.?)>(*.?)'".$keywords."'(*.?)<\/($1)>/'", $str, $matches, PREG_SET_ORDER);
    //echo $match;

    $doc = new DOMDocument();
    $doc->loadHTML($str);

    $xPath = new DOMXpath($doc);
    $xPathQuery = "//text()[contains(translate(.,'abcdefghijklmnopqrstuvwxyz', 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'), '".strtoupper($keywords)."')]";
    $elements = $xPath->query($xPathQuery);

    if($elements->length > 0){

    foreach($elements as $element){
        print "Gevonden: " .$element->nodeValue."<br />";
    }}

    $result = mysql_query("SELECT * FROM crawler WHERE data = '".$element->nodeValue."' ") ;

    if(mysql_num_rows($result)>0)
    echo 'Column already exist';

    else{ 
    echo 'added';
    mysql_query("INSERT INTO crawler (id, domain, url, keywords, data) VALUES ('', '".$page_data["referer_url"]."', '".$page_data["url"]."', '".$keywords."', '".$element->nodeValue. "' )");
    }

    echo '<br>';
    echo "<br><br>";
    echo str_pad(" ", 5000); // "Force flush", workaround
    flush();



}

FORGOT TO SAY: I NEED A WIN x(86) 32 bits workaround!

Because it's not supported on my client.

  • Comment 2 all: Look for the funny word in the code and win some great prizes! –  Sep 14 '10 at 07:43

3 Answers3

1

I wonder if you wouldn't be better served by going with something like Gearman for this.

It's a job manager that runs on your system and you submit jobs to it (via php if you like), and then it assigns them to workers (again, written in php), who then report back with their result. It's pretty robust and flexible in that you can let it run more workers to handle more workload.

Fanis Hatzidakis
  • 5,282
  • 1
  • 33
  • 36
  • Very nice but it's not where i'm looking for;) +1 anyway. –  Sep 14 '10 at 07:49
  • If everything has to sit on win32 then yes, Gearman is not suitable at the moment. I'm afraid I can't help you with pcntl_fork but best of luck with it :) – Fanis Hatzidakis Sep 14 '10 at 08:21
0

shell_exec does the thing but don't know how to use.

0

Look into this: http://in.php.net/manual/en/ref.pcntl.php#37369

Phill Pafford
  • 83,471
  • 91
  • 263
  • 383