I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.
The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.
Example with just 1 client and 2 workers -
Client:
$client = new GearmanClient ();
$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');
$job = $client->do ('generate_tile', serialize ($arrData));
Worker:
$worker = new GearmanWorker ();
$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');
$worker->addFunction ('generate_tile', 'generate_tile');
while (1)
{
if (!$worker->work ())
{
switch ($worker->returnCode ())
{
default:
echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
break;
}
}
}
function generate_tile ($job) { ... }
The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.
However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.
I get the following set of errors from the remaining worker process:
Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....
When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.
It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.
Many thanks,
Andy