10

I'm having a problem with gearman workers running on multiple servers which i can't seem to solve.

The problem occurs when a worker server is taken offline, rather than the worker process being cancelled, and causes all other worker processes to error and fail.

Example with just 1 client and 2 workers -

Client:

$client = new GearmanClient ();

$client->addServer ('192.168.1.200');
$client->addServer ('192.168.1.201');

$job = $client->do ('generate_tile', serialize ($arrData));

Worker:

$worker = new GearmanWorker ();

$worker->addServer ('192.168.1.200');
$worker->addServer ('192.168.1.201');

$worker->addFunction ('generate_tile', 'generate_tile');

while (1)
{
    if (!$worker->work ())
    {

        switch ($worker->returnCode ())
        {

            default:
                echo "Error: " . $worker->returnCode () . ': ' . $worker->error () . "\n";
                break;

        }

    }
}

function generate_tile ($job) { ... }

The worker code is being run on 2 separate servers. When every server is up and running both workers execute jobs as expected. When one of the worker processes is cancelled, the other worker executes all jobs as expected.

However, when the server with the cancelled worker process is shutdown and taken completely offline, requests to the client script hang and the remaining worker process does not pick up any jobs.

I get the following set of errors from the remaining worker process:

Error: 46: gearman_con_wait:timeout reached
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:110
Error: 46: gearman_con_wait:timeout reached
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
Error: 4: gearman_con_flush:write:113
....

When i start-up the other server, not starting the worker process on it, the remaining worker process immediately jumps into life and executes any remaining jobs.

It seems clear to me that i need some code in the worker process to cope with any servers that may be offline, however i cannot see how to do this.

Many thanks,

Andy

Andy Burton
  • 566
  • 6
  • 12

5 Answers5

6

Our tests with multiple gearman servers shows that if the last server in the list (192.168.1.201 in your case) is taken down, the workers stop executing the way you are describing. (Also, the workers grab jobs from the last server. They process jobs on .200 only if on .201 there are no jobs).

It seems that this is a bug with the linked list in the gearman server, which is reported to be fixed multiple times, but with all available versions of gearman, the bug persist. Sorry, I know that's not a solution, but we had the same problem and didn't found a solution. (if someone can provide working solution for this problem, I agree to give large bounty)

Maxim Krizhanovsky
  • 26,265
  • 5
  • 59
  • 89
  • 3
    Interesting, thanks. I have changed the order of the servers so that the worker server i am shutting down is the first server rather than the last and, although some errors are still generated, the worker does process jobs correctly. The way i would suggest to work around this is to run a worker on the client server and set it as the last server. This way if any of the worker servers go down then it's not a problem, everything works, as none of them are the last server added, but if the worker/client server goes down then the client is down anyway, so no new jobs can be processed. – Andy Burton Aug 16 '11 at 13:14
  • Does anyone have a link to the bug report? – Jason Axelson May 05 '15 at 19:07
4

Further to @Darhazer 's comment above. We found that as well and solved like thus :-

// Gearman workers show a strong preference for servers at the end of a list so randomize the order
$worker = new GearmanWorker();
$s2 = explode(",", Configure::read('workers.servers'));
shuffle($s2);
$servers = implode(",", $s2);
$worker->addServers($servers); 

We run 6 to 10 workers at any time, and expire them after they've completed x requests.

Richard
  • 279
  • 3
  • 4
2

I use this class, which keep track of which jobs work on which servers. It hasn't been thoroughly tested, just wrote it now. I've pasted an edited version, so there might be a typo or somesuch, but otherwise appears to solve the issue.

<?
class MyGearmanClient {
        static $server = "server1,server2,server3";
        static $server_array = false;
        static $workingServers = false;
        static $gmclient = false;
        static $timeout = 5000;
        static $defaultTimeout = 5000;

        static function randomServer() {
                return self::$server_array[rand(0, count(self::$server_array) -1)];
        }

        static function getServer($job = false) {
                if (self::$server_array == false) {
                        self::$server_array = explode(",", self::$server);
                        self::$workingServers = array();
                }

                $serverList = array();
                if ($job) {
                        if (array_key_exists($job, self::$workingServers)) {
                                foreach (self::$server_array as $server) {
                                        if (array_key_exists($server, self::$workingServers[$job])) {
                                                if (self::$workingServers[$job][$server]) {
                                                        $serverList[] = $server;
                                                }
                                        } else {
                                                $serverList[] = $server;
                                        }
                                }
                                if (count($serverList) == 0) {
                                        # All servers have failed, need to insert all the servers again and retry.
                                        $serverList = self::$workingServers[$job] = self::$server_array;
                                }
                                return $serverList[rand(0, count($serverList) - 1)];
                        } else {
                                return self::randomServer();
                        }
                } else {
                        return self::randomServer();
                }
        }

        static function serverWorked($server, $job) {
                self::$workingServers[$job][$server] = $server;
        }

        static function serverFailed($server, $job) {
                self::$workingServers[$job][$server] = false;
        }

        static function Connect($server = false, $job = false) {
                if ($server) {
                        self::$server = self::getServer();
                }

                self::$gmclient= new GearmanClient();
                self::$gmclient->setTimeout(self::$timeout);

                # add the default job server
                self::$gmclient->addServer($server = self::getServer($job));

                return $server;
        }

        static function Destroy() {
                self::$gmclient = false;
        }

        static function Client($name, $vars, $timeout = false) {
                if (is_int($timeout)) {
                        self::$timeout = $timeout;
                } else {
                        self::$timeout = self::$defaultTimeout;
                }


                do {
                        $server = self::Connect(false, $name);
                        $value = self::$gmclient->do($name, $vars);
                        $return_code = self::$gmclient->returnCode();
                        if (!$value) {
                                $error_message = self::$gmclient->error();
                                if ($return_code == 47) {
                                        self::serverFailed($server, $name);
                                        if (count(self::$server_array) > 1) {
                                             // ADDED SINGLE SERVER LOOP AVOIDANCE // echo "Timeout on server $server, trying another server...\n";
                                             continue;
                                        } else {
                                             return false;
                                        }
                                }
                                echo "ERR: $error_message ($return_code)\n";
                        }
                        # printf("Worker has returned\n");
                        $short_value = substr($value, 0, 80);
                        switch ($return_code)
                        {
                        case GEARMAN_WORK_DATA:
                                echo "DATA: $short_value\n";
                                break;
                        case GEARMAN_SUCCESS:
                                self::serverWorked($server, $name);
                                break;
                        case GEARMAN_WORK_STATUS:
                                list($numerator, $denominator)= self::$gmclient->doStatus();
                                echo "Status: $numerator/$denominator\n";
                                break;
                        case GEARMAN_TIMEOUT:
                                // self::Connect();
                                // Fall through
                        default:
                                echo "ERR: $error_message " . self::$gmclient->error() . " ($return_code)\n";
                                break;
                        }
                }
                while($return_code != GEARMAN_SUCCESS);

                $rv = unserialize($value);
                return $rv["rv"];
        }
}

# Example usage:
#    $rv = MyGearmanClient::Client("Function", $args);

?>
Orwellophile
  • 296
  • 3
  • 4
  • That's a handy bit of code, thank you - i see that it randomly returns a server and then builds up an array of working servers to return with multiple requests. For the worker scripts, you'd just add the single server then? Rather than adding all servers to the worker as i have been doing because you're only adding the single worker server to the client? – Andy Burton Aug 17 '11 at 09:32
  • Mostly, I add all the workers to all the servers. But I do have one worker that needs to return fast results, for which I use dedicated "local" (one per network) server. Which reminds me, I had to correct a bug which causes an infinite retry loop if you only specify a single server - this is only a problem for me because I perform local (read: "exec()") processing if I don't get an answer in 1 second. – Orwellophile Aug 20 '11 at 11:15
0

since 'addServer' from gearman client is not working properly this code can choose a jobserver randomly and if fails try the next one, this way you can balance the load.

        // job servers
        $jobservers = array('192.168.1.1','192.168.1.2');
        // prepare gearman client
        $gmclient = new GearmanClient();
        // shuffle job servers (deliver jobs equally by server)
        shuffle($jobservers);
        // add job servers
        foreach($jobservers as $jobserver) {
            // add random jobserver
            $gmclient->addServer($jobserver);
            // check server state if ok end foreach
            if (@$gmclient->ping('ping')) break;
            // if connections fails reset client
            $gmclient = new GearmanClient();
        }
DaLtOn
  • 1
  • 1
0

Solution tested and working ok.

     $client = new GearmanClient();
     if(!$client->addServer("11.11.65.73",4730))
        $client->addServer("11.11.65.79",4730);