3

I'm dealing with Godaddy auction domains, they provide some way to download domains listing. I do have a cron job developed to download & dump (insert) domains listing into my database table. This process takes few seconds from download and dumping into database. The total number of domains (records) in this case are 34000 entries.

Second, I need to update the page rank for each individual domain in database for total 34000 records. I have the PHP API for fetching the page rank live. The Godaddy downloads don't provide page rank detail so I have to fetch and update it separately.

Now, the problem is when it comes to fetching page rank live and then updating page rank into database takes too much time for total 34000 domains.

I recently did an experiment via cron job to update page rank for domains in database, it took 4 hours to update page rank just for 13383 domains from 34000 total. Since it has to first fetch and then update into database. This all was going on dedicated server.

Is there any way to speed up this process for large number of domains? The only way, I'm thinking is to accomplish this via multitasking.

Would that be possible to have 100 tasks fetching page rank and updating it into database simultaneously?

In case you need the code:

$sql = "SELECT domain from auctions";
    $mozi_get=runQuery($sql);

    while($results = mysql_fetch_array($mozi_get)){
        /* PAGERANK API*/
        if($results['domain']!='Featured Listings'){
            //echo $results['domain']."<br />"; 
            try 
                {
                  $url = new SEOstats("http://www.".trim($results['domain']));
                  $rank=$url->Google_Page_Rank();
                  if(!is_integer($rank)){
                    //$rank='0';
                   }
                } 
                catch (SEOstatsException $e) 
                {
                  $rank='0';
                }
                try 
                {
                  $url = new SEOstats(trim("http://".$results['domain']));
                  $rank_non=$url->Google_Page_Rank();
                  if(!is_integer($rank_non)){
                    //$rank_non='0';
                   }
                } 
                catch (SEOstatsException $e) 
                {
                  $rank_non='0';
                }




            $sql = "UPDATE auctions set rank='".$rank."',  rank_non='".$rank_non."' WHERE domain='".$results['domain']."'"; 
            runQuery($sql);
            echo $sql."<br />";
        }
    }

Here is my updated code for pthreads:

<?php
set_time_limit(0);
require_once("database.php");
include 'src/class.seostats.php';


function get_page_rank($domain) {

    try {

        $url = new SEOstats("http://www." . trim($domain));

        $rank = $url->Google_Page_Rank();

        if(!is_integer($rank)){
              $rank = '0';
         }


    } catch (SEOstatsException $e) {

        $rank = '0';
    }

    return $rank;
}

class Ranking extends Worker {
  public function run(){}
}

class Domain extends Stackable {

  public $name;
  public $ranking;

  public function __construct($name) {

    $this->name = $name;

  }

  public function run() {

    $this->ranking = get_page_rank($this->name);

    /* now write the Domain to database or whatever */

    $sql = "UPDATE auctions set rank = '" . $this->ranking . "' WHERE domain = '" . $this->name . "'"; 
    runQuery($sql);

  }

}

/* start some workers */
$workers = array();
while (@$worker++ < 8) {
  $workers[$worker] = new Ranking();
  $workers[$worker]->start();
}

/* select auctions and start processing */

$domains = array();

$sql = "SELECT domain from auctions"; // RETURNS 55369 RECORDS

$domain_result = runQuery($sql);

while($results = mysql_fetch_array($domain_result)) {

  $domains[$results['domain']] = new Domain($results['domain']);
  $workers[array_rand($workers)]->stack($domains[$results['domain']]);

}


/* shutdown all workers (forcing all processing to finish) */
foreach ($workers as $worker)
  $worker->shutdown();

/* we now have ranked domains in memory and database */
var_dump($domains);
var_dump(count($domains));
?>

Any help will be highly appreciated. Thanks

Kappa
  • 1,015
  • 1
  • 16
  • 31
Irfan
  • 4,882
  • 12
  • 52
  • 62

1 Answers1

2

Well here's a pthreads example that will allow you to multi-thread your operations ... I have chosen the worker model and am using 8 workers, how many workers you use depends on your hardware and the service receiving the requests ... I've never used SEOstats or godaddy domain auctions, I'm not sure of the CSV fields and will leave the getting of page ranks to you ...

<?php
define ("CSV", "https://auctions.godaddy.com/trpSearchResults.aspx?t=12&action=export");

/* I have no idea how to get the actual page rank */
function get_page_rank($domain) {
  return rand(1,10);
}

class Ranking extends Worker {
  public function run(){}
}

class Domain extends Stackable {
  public $auction;
  public $name;
  public $bids;
  public $traffic;
  public $valuation;
  public $price;
  public $ending;
  public $type;
  public $ranking;

  public function __construct($csv) {
    $this->auction = $csv[0];
    $this->name = $csv[1];
    $this->traffic = $csv[2];
    $this->bids = $csv[3];
    $this->price = $csv[5];
    $this->valuation = $csv[4];
    $this->ending = $csv[6];
    $this->type = $csv[7];
  }

  public function run() {
    /* we convert the time to a stamp here to keep the main thread moving */
    $this->ending = strtotime(
      $this->ending);

    $this->ranking = get_page_rank($this->name);

    /* now write the Domain to database or whatever */
  }
}

/* start some workers */
$workers = array();
while (@$worker++ < 8) {
  $workers[$worker] = new Ranking();
  $workers[$worker]->start();
}

/* open the CSV and start processing */
$handle = fopen(CSV, "r");
$domains = array();
while (($line = fgetcsv($handle))) {
  $domains[$line[0]] = new Domain($line);
  $workers[array_rand($workers)]->stack(
    $domains[$line[0]]);
}

/* cleanup handle to csv */
fclose($handle);

/* shutdown all workers (forcing all processing to finish) */
foreach ($workers as $worker)
  $worker->shutdown();

/* we now have ranked domains in memory and database */
var_dump($domains);
var_dump(count($domains));
?>

Questions:

  1. Right, 8 workers
  2. Workers execute Stackable objects in the order they were stack()'d, this line chooses a random worker to execute the Stackable
  3. You can traverse the list of $domains in the main process during execution, checking the status of each Stackable as you are executing
  4. All of each workers stack will be executed before the shutdown takes place, the shutdown ensures that all work is therefore done by that point in the execution of the script.
Joe Watkins
  • 17,032
  • 5
  • 41
  • 62
  • I tested your code, this going to be the great script for my problem. I have few questions if you could help me 1) Basically you created 8 workers, Right? 2) $workers[array_rand($workers)]->stack( $domains[$line[0]]); what this line of code does is to execute run function in Domain class? 3) Where can I see the progress of each worker? 4) Why are you forcing it to shutdown? – Irfan Sep 26 '13 at 11:06
  • I have integrated your code with mine. Would you please have a look and let me know if I'm on right direction. Is it possible for single worker to process multiple stack objects since you are choosing it randomly? I really appreciate your help and expect to have to have my problem solve with you. My database query return 55369 records and each need updated for pagerank. Thanks – Irfan Oct 03 '13 at 08:33
  • It looks about right ... assuming runQuery negotiates a connection for the workers where it needs one. Random is the easiest to show, and works for everyone, depending on what you are doing you might want to stack to the least busy worker ... – Joe Watkins Oct 03 '13 at 12:10