corosync/pacemaker "stale" state after a week of running

Question

I have a simple 3 node pacemaker/corosync setup. Ubuntu 14.04.2. 3 nodes, 2 resources - IPs - configured.

ii  crmsh                               1.2.5+hg1034-1ubuntu4            all          CRM shell for the pacemaker cluster manager
ii  pacemaker                           1.1.10+git20130802-1ubuntu2.3    amd64        HA cluster resource manager
ii  pacemaker-cli-utils                 1.1.10+git20130802-1ubuntu2.3    amd64        Command line interface utilities for Pacemaker
ii  corosync                            2.3.3-1ubuntu1                   amd64        Standards-based cluster framework (daemon and modules)
ii  libcorosync-common4                 2.3.3-1ubuntu1                   amd64        Standards-based cluster framework, common library

Works flawlessly, except, when left for one week without any failover/reboot, the cluster stops reacting to nodes dying. I was able to reproduce the situation few times.

When i reboot a node, crm status command on other nodes shows it as "UP" (id expect to see state DOWN in between).

If i restart another node, preferably the DC, then i end up with "no quorum" on the last node - 2 out of 3 are down temporarily.

Finally when the first two boot up again, the cluster is healthy again.

If, now, i restart any of the 3 nodes - i instantly can see crm status being updated with "DOWN" on given node. And this will work for next few days, until it becomes "stale" again.

Can someone hintwhat can be the cause of that? Freshly restarted cluster works perfectly, for some days. Then the DC becames... "stale" ??

grepping for 'corosync\|pacemakerd\|crmd\|attrd' in syslogs didnt show me the problem (or i missed it)

Should i schedule a daily restart of corosync/pacemaker to prevent this wierd state?

Here is my basic corosync.conf file:

totem {
        version: 2
        token: 3000
        token_retransmits_before_loss_const: 10
        join: 60
        consensus: 3600
        vsftype: none
        max_messages: 20
        clear_node_high_bit: yes
        secauth: off
        threads: 0
        rrp_mode: none
        interface {
                ringnumber: 0
                bindnetaddr: 10.20.0.0
                mcastaddr: 226.94.1.1
                mcastport: 5405
        }
}

amf {
        mode: disabled
}

quorum {
        # Quorum for the Pacemaker Cluster Resource Manager
        provider: corosync_votequorum
        expected_votes: 2
}

aisexec {
        user:   root
        group:  root
}

logging {
        fileline: off
        to_stderr: yes
        to_logfile: no
        to_syslog: yes
        syslog_facility: daemon
        debug: off
        timestamp: on
        logger_subsys {
                subsys: AMF
                debug: off
                tags: enter|leave|trace1|trace2|trace3|trace4|trace6
        }
}

corosync/pacemaker "stale" state after a week of running

0 Answers0