I have a simple 3 node pacemaker/corosync setup. Ubuntu 14.04.2. 3 nodes, 2 resources - IPs - configured.
ii crmsh 1.2.5+hg1034-1ubuntu4 all CRM shell for the pacemaker cluster manager
ii pacemaker 1.1.10+git20130802-1ubuntu2.3 amd64 HA cluster resource manager
ii pacemaker-cli-utils 1.1.10+git20130802-1ubuntu2.3 amd64 Command line interface utilities for Pacemaker
ii corosync 2.3.3-1ubuntu1 amd64 Standards-based cluster framework (daemon and modules)
ii libcorosync-common4 2.3.3-1ubuntu1 amd64 Standards-based cluster framework, common library
Works flawlessly, except, when left for one week without any failover/reboot, the cluster stops reacting to nodes dying. I was able to reproduce the situation few times.
When i reboot a node, crm status command on other nodes shows it as "UP" (id expect to see state DOWN in between).
If i restart another node, preferably the DC, then i end up with "no quorum" on the last node - 2 out of 3 are down temporarily.
Finally when the first two boot up again, the cluster is healthy again.
If, now, i restart any of the 3 nodes - i instantly can see crm status being updated with "DOWN" on given node. And this will work for next few days, until it becomes "stale" again.
Can someone hintwhat can be the cause of that? Freshly restarted cluster works perfectly, for some days. Then the DC becames... "stale" ??
grepping for 'corosync\|pacemakerd\|crmd\|attrd' in syslogs didnt show me the problem (or i missed it)
Should i schedule a daily restart of corosync/pacemaker to prevent this wierd state?
Here is my basic corosync.conf file:
totem {
version: 2
token: 3000
token_retransmits_before_loss_const: 10
join: 60
consensus: 3600
vsftype: none
max_messages: 20
clear_node_high_bit: yes
secauth: off
threads: 0
rrp_mode: none
interface {
ringnumber: 0
bindnetaddr: 10.20.0.0
mcastaddr: 226.94.1.1
mcastport: 5405
}
}
amf {
mode: disabled
}
quorum {
# Quorum for the Pacemaker Cluster Resource Manager
provider: corosync_votequorum
expected_votes: 2
}
aisexec {
user: root
group: root
}
logging {
fileline: off
to_stderr: yes
to_logfile: no
to_syslog: yes
syslog_facility: daemon
debug: off
timestamp: on
logger_subsys {
subsys: AMF
debug: off
tags: enter|leave|trace1|trace2|trace3|trace4|trace6
}
}