8

Using two Debian servers, I need to setup a strong failover environment for cron jobs that can be only called on one server at a time.

Moving a file in /etc/cron.d should do the trick, but is there a simple HA solution to operate such action ? And if possible not with heartbeat ;)

Falken
  • 1,702
  • 5
  • 18
  • 28
  • For the record, I finally used heartbeat to do the job. There is an easier solution however, if your machines are in the same subnet and can do multicast, I would recommend using ucarp. Much simpler than heartbeat --> http://www.ucarp.org – Falken Oct 06 '09 at 09:23
  • 1
    rcron? Gnubatch? Puppet? – symcbean Mar 08 '14 at 00:58
  • I second rcron. I'm currently using it and have almost the same setup (2 ubuntu servers behind a loadbalancer). – Ali May 02 '16 at 15:54

7 Answers7

5

I think heartbeat / pacemaker would be the best solution, since they can take care a lot of a lot of race conditions, fencing, etc for you in order to ensure the job only runs on one host at a time. It's possible to design something yourself, but it likely won't account for all the scenarios those packages do, and you'll eventually end up replacing most of, if not all, of the wheel.

If you don't really care about such things and you want a simpler setup. I suggest staggering the cron jobs on the servers by a few minutes. Then when the job starts on the primary it can somehow leave a marker on whatever shared resource the jobs operate on (you don't specify this, so I'm being intentionally vague). If it's a database, they can update a field in a table or if it's on a shared filesystem lock a file.

When the job runs on the second server, it can check for the presence of the marker and abort if it is there.

Kamil Kisiel
  • 12,184
  • 7
  • 48
  • 69
1

Actually there is no solution that is satisfactory in this area. We have tried them all. scripting solutions, cron with heartbeat/pacemaker and more. The only solution, until recently, was a grid solution. naturally this is not what we want seeing as how a grid solution is a bit more than overkill for the scenario.

That's why I started the CronBalancer project. works exactly like a normal cron server except it's distributed, load-balanced and HA (when finished). Currently the first 2 points are finished (beta) and works with a standard crontab file.

the HA framework is in place. all that's left is the signaling needed to determine the fail-over and recovering actions.

http://sourceforge.net/projects/cronbalancer/

chuck

1

I had been using Nagios event handler as a simple solution.

On the NRPE server:

command[check_crond]=/usr/lib64/nagios/plugins/check_procs -c 1: -C crond
command[autostart_crond]=sudo /etc/init.d/crond start
command[stop_crond]=sudo /etc/init.d/crond stop

Don't forget to add the nagios user to the sudoers group:

nagios  ALL=(ALL)   NOPASSWD:/usr/lib64/nagios/plugins/, /etc/init.d/crond

and disable requiretty:

Defaults:nagios !requiretty

On the Nagios server:

services.cfg

define service{
    use                     generic-service
    host_name               cpc_3.145
    service_description     crond
    check_command           check_nrpe!check_crond
    event_handler           autostart_crond!cpc_2.93
    process_perf_data       0
    contact_groups          admin,admin-sms
}

commands.cfg

define command{
    command_name    autostart_crond
    command_line    $USER1$/eventhandlers/autostart_crond.sh $SERVICESTATE$ $SERVICESTATETYPE$ $SERVICEATTEMPT$ $ARG1$
}

autostart_crond.sh

#!/bin/bash

case "$1" in
    OK)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c stop_crond
        ;;
    WARNING)
        ;;
    UNKNOWN)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
        ;;
    CRITICAL)
        /usr/local/nagios/libexec/check_nrpe -H $4 -c autostart_crond
        ;;
esac

exit 0

but I have switched to use Pacemaker and Corosync since it's the best solution to ensure that the resource only run on one node at a time.

Here're the steps what I've done:

Verify that the crond init script is LSB compliant. On my CentOS, I have to change the exit status from 1 to 0 (if start a running or stop a stopped) to match the requirements:

start() {
    echo -n $"Starting $prog: " 
    if [ -e /var/lock/subsys/crond ]; then
        if [ -e /var/run/crond.pid ] && [ -e /proc/`cat /var/run/crond.pid` ]; then
            echo -n $"cannot start crond: crond is already running.";
            failure $"cannot start crond: crond already running.";
            echo
            #return 1
            return 0
        fi
    fi

stop() {
    echo -n $"Stopping $prog: "
    if [ ! -e /var/lock/subsys/crond ]; then
        echo -n $"cannot stop crond: crond is not running."
        failure $"cannot stop crond: crond is not running."
        echo
        #return 1;
        return 0;
    fi

then it can be added to the Pacemaker by using:

# crm configure primitive Crond lsb:crond \
        op monitor interval="60s"

crm configure show

node SVR022-293.localdomain
node SVR233NTC-3145.localdomain
primitive Crond lsb:crond \
        op monitor interval="60s"
property $id="cib-bootstrap-options" \
        dc-version="1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f" \
        cluster-infrastructure="openais" \
        expected-quorum-votes="2" \
        stonith-enabled="false" \
        no-quorum-policy="ignore"
rsc_defaults $id="rsc-options" \
        resource-stickiness="100"

crm status

============
Last updated: Fri Jun  7 13:44:03 2013
Stack: openais
Current DC: SVR233NTC-3145.localdomain - partition with quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ SVR022-293.localdomain SVR233NTC-3145.localdomain ]

 Crond  (lsb:crond):    Started SVR233NTC-3145.localdomain

Testing failover by stopping Pacemaker and Corosync on 3.145:

[root@3145 corosync]# service pacemaker stop
Signaling Pacemaker Cluster Manager to terminate:          [  OK  ]
Waiting for cluster services to unload:......              [  OK  ]

[root@3145 corosync]# service corosync stop
Signaling Corosync Cluster Engine (corosync) to terminate: [  OK  ]
Waiting for corosync services to unload:.                  [  OK  ]

then check the cluster status on the 2.93:

============
Last updated: Fri Jun  7 13:47:31 2013
Stack: openais
Current DC: SVR022-293.localdomain - partition WITHOUT quorum
Version: 1.1.5-1.1.el5-01e86afaaa6d4a8c4836f68df80ababd6ca3902f
2 Nodes configured, 2 expected votes
1 Resources configured.
============

Online: [ SVR022-293.localdomain ]
OFFLINE: [ SVR233NTC-3145.localdomain ]

Crond   (lsb:crond):    Started SVR022-293.localdomain
quanta
  • 51,413
  • 19
  • 159
  • 217
1

We use two approaches depending on the requirements. Both involve having the crons present and running from all machines, but with a bit of sanity checking involved:

  1. If the machines are in a primary and secondary (there may be more than one secondary) relationship then the scripts are modified to check whether the machine they are running on is a primary state. If not, then they simply exit quietly. I don't have an HB setup to hand at the moment but I believe you can query HB for this information.

  2. If all machines are eligible primaries (such as in a cluster) then some locking is used. By way of either a shared database or PID file. Only one machine ever obtains the lock status and those which don't exit quietly.

Dan Carley
  • 25,617
  • 5
  • 53
  • 70
1

To make long story short you have to turn your cron scripts into some kind of cluster-able applications. Being the implementation as lightweight or as heavyweight as you need, they still need one thing - be able to properly resume/restart action (or recover their state) after primary node failover. The trivial case is that they are stateless programs (or "stateless enough" programs), that can be simply restarted any time and will do just fine. This is probably not your case. Note that for stateless programs you don't need failover because you could simply run them in parallel on all the nodes.

In normally complicated case, your scripts should be on cluster's shared storage, should store their state in files there, should change the state stored on disk only atomically, and should be able to continue their action from any transient state they will detect on startup.

kubanczyk
  • 13,812
  • 5
  • 41
  • 55
1

I prefer Rcron for this particular problem. You have a state file, which simply says "active" or "passive", and if it's active your cron will run on a certain machine. If state file is set to passive it won't run. Simple as that.

Now, you can use RedHat Cluster Suite or any other clustering middleware to manage state files across your cluster, or you can manually set active on a certain node and that's it.

Jakov Sosic
  • 5,267
  • 4
  • 24
  • 35
0

Making it execute/not execute on a particular machine is trivial. Either have a script put a cron job in /etc/cron.d, as you suggest, or have the script permanently in /etc/cron.d, but have the script itself do the failover checking and decide whether to execute.

The common (missing) part in both of these is how the script checks to see if the script on the other machine is running.

Without more information about what you're trying to do, this is hard to answer.

Schof
  • 972
  • 1
  • 6
  • 10