3

How should I multithread some php-cli code that needs a timeout?

I'm using PHP 5.6 on Centos 6.6 from the command line.

I'm not very familiar with multithreading terminology or code. I'll simplify the code here but it is 100% representative of what I want to do.

The non-threaded code currently looks something like this:

$datasets = MyLibrary::getAllRawDataFromDBasArrays();
foreach ($datasets as $dataset) {
    MyLibrary::processRawDataAndStoreResultInDB($dataset);
}
exit; // just for clarity

I need to prefetch all my datasets, and each processRawDataAndStoreResultInDB() cannot fetch it's own dataset. Sometimes processRawDataAndStoreResultInDB() takes too long to process a dataset, so I want to limit the amount of time it has to process it.

So you can see that making it multithreaded would

  1. Speed it up by allowing multiple processRawDataAndStoreResultInDB() to execute at the same time
  2. Use set_time_limit() to limit the amount of time each one has to process each dataset

Notice that I don't need to come back to my main program. Since this is a simplification, you can trust that I don't want to collect all the processed datasets and do a single save into the DB after they are all done.

I'd like to do something like:

class MyWorkerThread extends SomeThreadType {
  public function __construct($timeout, $dataset) {
    $this->timeout = $timeout;
    $this->dataset = $dataset;
  }

  public function run() {
    set_time_limit($this->timeout);
    MyLibrary::processRawDataAndStoreResultInDB($this->dataset);
  } 
}

$numberOfThreads = 4;
$pool = somePoolClass($numberOfThreads);
$pool->start();

$datasets = MyLibrary::getAllRawDataFromDBasArrays();
$timeoutForEachThread = 5; // seconds
foreach ($datasets as $dataset) {
  $thread = new MyWorkerThread($timeoutForEachThread, $dataset);

  $thread->addCallbackOnTerminated(function() {
    if ($this->isTimeout()) {
      MyLibrary::saveBadDatasetToDb($dataset);
    }
  }

  $pool->addToQueue($thread);
}

$pool->waitUntilAllWorkersAreFinished();
exit; // for clarity

From my research online I've found the PHP extension pthreads which I can use with my thread-safe php CLI, or I could use the PCNTL extension or a wrapper library around it (say, Arara/Process)

When I look at them and their examples though (especially the pthreads pool example) I get confused quickly by the terminology and which classes I should use to achieve the kind of multithreading I'm looking for.

I even wouldn't mind creating the pool class myself, if I had a isRunning(), isTerminated(), getTerminationStatus() and execute() function on a thread class, as it would be a simple queue.

Can someone with more experience please direct me to which library, classes and functions I should be using to map to my example above? Am I taking the wrong approach completely?

Thanks in advance.

Finlay Beaton
  • 601
  • 6
  • 15
  • 1
    I don't think you need multi-threading specifically. I'd just use a queue like Gearman or Resque, and from your worker code you can invoke `php` with a custom timeout ini setting. If you want to take advantage of multiple processors/cores, just have a few workers running, so they can grab jobs in parallel. – halfer Feb 13 '15 at 21:58
  • Are worker processes an option for you? If yes, I would simply fork off some workers... – hek2mgl Feb 13 '15 at 22:21
  • @halfer: Adding additional programs or services (like german and resque) to my install is a bit more complexity than I'd like, they come with their own configurations and headaches. I'd rather just use php's ability to create my threads, even if that means headaches in figuring out how to do it (as this question shows). – Finlay Beaton Feb 13 '15 at 22:34
  • @hek2mgl: indeed I believe worker processes are an option for me and what I had in mind! My question is how? PHP provides many different ways of doing this in both the pcntl and pthreads extensions, I am having trouble pinpointing what I should be using to achieve the desired result above. – Finlay Beaton Feb 13 '15 at 22:36
  • I'd say installing a queue is less hassle than having to recompile PHP `;-)` (afaik on Centos you'll have to do that for `pthreads`, it's only on some Windows binaries that this is already done for you). – halfer Feb 13 '15 at 22:37
  • @FinlayBeaton I'm currently preparing an example for you ;) – hek2mgl Feb 13 '15 at 22:37
  • @halfer: I'm not! We are using rpm's from webtatic for centos, and they provide a thread-safe php cli in the executable: zts-php -- pcntl is also enabled even in non-thread-safe php-cli I believe. I'm not sure I could get another program installed on these servers through our ops group. I agree that gearman and resque are cool, but sadly not within the scope of this question. – Finlay Beaton Feb 13 '15 at 22:39
  • OK, if pthreads is installed, fair enough. I take it that this appears as an extension with `php -m` then? – halfer Feb 13 '15 at 22:43
  • 1
    @halfer: that is correct, pthreads and pcntl are both available and listed in `zts-php -m` which is my php cli I'm using from webtatic. – Finlay Beaton Feb 13 '15 at 22:46
  • for example, some of my confusion comes from the plethora of options. pthreads offers classes Threaded, Thread, Worker, Collectable, Pool. pcntl offers functions fork, exec. pcntl wrappers like Arara offer classes Callback, Command, Daemon, Spawning, Control. Reading the documentation I couldn't figure out what I should be using. I have a feeling pthreads > pcntl since I have pthreads available, but beyond that the implementation details look completely foreign to me, unfortunately. – Finlay Beaton Feb 13 '15 at 22:57

1 Answers1

0

Here comes an example using worker processes. I'm using the pcntl extension.

/**
 * Spawns a worker process and returns it pid or -1 
 * if something goes wrong.
 *
 * @param callback function, closure or method to call
 * @return integer
 */
function worker($callback) {
    $pid = pcntl_fork();
    if($pid === 0) {
        // Child process
        exit($callback());
    } else {
        // Main process or an error
        return $pid;
    }
}


$datasets = array(
    array('test', '123'),
    array('foo', 'bar')
);

$maxWorkers = 1;
$numWorkers = 0;
foreach($datasets as $dataset) {
    $pid = worker(function () use ($dataset) {
        // Do DB stuff here
        var_dump($dataset);
        return 0;
    });

    if($pid !== -1) {
        $numWorkers++;
    } else {
        // Handle fork errors here
        echo 'Failed to spawn worker';
    }

    // If $maxWorkers is reached we need to wait
    // for at least one child to return
    if($numWorkers === $maxWorkers) {
        // $status is passed by reference
        $pid = pcntl_wait($status);
        echo "child process $pid returned $status\n";
        $numWorkers--;
    }
}

// (Non blocking) wait for the remaining childs
while(true) {
    // $status is passed by reference
    $pid = pcntl_wait($status, WNOHANG);

    if(is_null($pid) || $pid === -1) {
        break;
    }

    if($pid === 0) {
        // Be patient ...
        usleep(50000);
        continue;
    }

    echo "child process $pid returned $status\n";
}
hek2mgl
  • 152,036
  • 28
  • 249
  • 266
  • Thank for this solution, I'm currently using it. I was hoping for a solution using pthread's but at the end I just wanted something that worked. It should be noted that the if ($numWorkers... in the foreach() needs to decrement $numWorkers. For others, be careful if you have resources (files, or database connections) open that you want to re-use, as they won't work in the child, you have to re-open them. – Finlay Beaton Feb 16 '15 at 15:31
  • About $numWorkers, yes, you are right. Edited that. Yes, you probably would need to reopen some resources like database connections (have you tried that?) At least file handles should work in the child. Also sockets.. – hek2mgl Feb 16 '15 at 15:32
  • 1
    For others trying to do the same thing I was, it should be noted that it does not appear you can use set_time_limit if your timeout occurs outside of PHP as it was for me, and you will have to re-jig things using a while() with pcntl_wait WNOHANG in a single loop to create the queue, and take out the foreach. The above solution works it just doesn't enforce a timeout. – Finlay Beaton Feb 16 '15 at 19:30
  • 1
    upon further research, you can use pcntl_alarm() inside the worker threads instead of set_time_limit to achieve the result desired, and use the pcntl functions to figure out what happened in the pooler. – Finlay Beaton Feb 16 '15 at 20:09
  • Also I missed to say that the callback is expected to return an integer value (`0` for success). Btw, why not putting my starting point and your additions together in a small github library? – hek2mgl Feb 16 '15 at 21:16