I am running Corosync and Pacemaker via a cman stack, in an active-active setup delivering web pages, and I've hit a brick wall. I'm using the IPaddr2 resource agent to have an IP that is used simoultaneously by both nodes, and after about 15 minutes, it stops working. It is still running according to PCS, the ClusterIP rule is still in iptables, but the IP becomes unreachable. If I restart iptables, the cluster ip works for another fifteen minutes, then stops again. Here is my CIB config:
Resources:
Master: RepoDataClone
Meta Attrs: master-max=2 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
Resource: RepoData (class=ocf provider=linbit type=drbd)
Attributes: drbd_resource=fstore
Operations: start interval=0s timeout=240 (RepoData-start-timeout-240)
promote interval=0s timeout=90 (RepoData-promote-timeout-90)
demote interval=0s timeout=90 (RepoData-demote-timeout-90)
stop interval=0s timeout=100 (RepoData-stop-timeout-100)
monitor interval=30s (RepoData-monitor-interval-30s)
Resource: UpdateService (class=lsb type=rp-doReplicate)
Operations: monitor interval=30s (UpdateService-monitor-interval-30s)
Clone: ClusterIP-clone
Meta Attrs: clone-max=2 clone-node-max=2 globally-unique=true interleave=false
Resource: ClusterIP (class=ocf provider=heartbeat type=IPaddr2)
Attributes: ip=10.140.14.1 cidr_netmask=24 clusterip_hash=sourceip nic=eth0 broadcast=10.140.14.255 arp_interval=500 arp_bg=yes
Meta Attrs: resource-stickiness=0
Operations: start interval=0s timeout=20s (ClusterIP-start-timeout-20s)
stop interval=0s timeout=20s (ClusterIP-stop-timeout-20s)
monitor interval=5s (ClusterIP-monitor-interval-5s)
Clone: Repo-clone
Meta Attrs: interleave=false
Resource: Repo (class=ocf provider=heartbeat type=apache)
Attributes: configfile=/etc/httpd/conf/httpd.conf statusurl=http://localhost:443/server-status
Operations: start interval=0s timeout=40s (Repo-start-timeout-40s)
stop interval=0s timeout=60s (Repo-stop-timeout-60s)
monitor interval=30s (Repo-monitor-interval-30s)
Clone: RepoFS-clone
Meta Attrs: interleave=true
Resource: RepoFS (class=ocf provider=heartbeat type=Filesystem)
Attributes: device=/dev/drbd1 directory=/repo fstype=gfs2
Operations: start interval=0s timeout=60 (RepoFS-start-timeout-60)
stop interval=0s timeout=60 (RepoFS-stop-timeout-60)
monitor interval=10s (RepoFS-monitor-interval-10s)
Clone: dlm-clone
Resource: dlm (class=ocf provider=pacemaker type=controld)
Operations: start interval=0s timeout=90 (dlm-start-timeout-90)
stop interval=0s timeout=100 (dlm-stop-timeout-100)
monitor interval=30s (dlm-monitor-interval-30s)
Stonith Devices:
Fencing Levels:
Location Constraints:
Ordering Constraints:
start ClusterIP-clone then start Repo-clone (kind:Mandatory) (id:order-ClusterIP-Repo-mandatory)
promote RepoDataClone then start RepoFS-clone (kind:Mandatory) (id:order-RepoDataClone-RepoFS-mandatory)
start RepoFS-clone then start Repo-clone (kind:Mandatory) (id:order-RepoFS-Repo-mandatory)
start dlm-clone then start RepoFS-clone (kind:Mandatory) (id:order-dlm-RepoFS-mandatory)
start RepoFS then start UpdateService (kind:Mandatory) (id:order-RepoFS-UpdateService-mandatory)
Colocation Constraints:
Repo-clone with ClusterIP-clone (score:INFINITY) (id:colocation-Repo-ClusterIP-INFINITY)
RepoFS-clone with dlm-clone (score:INFINITY) (id:colocation-RepoFS-dlm-INFINITY)
RepoFS-clone with RepoDataClone (score:INFINITY) (id:colocation-RepoFS-RepoDataClone-INFINITY)
Cluster Properties:
cluster-infrastructure: cman
dc-version: 1.1.12-1.1.12+git20140723.483f48a
default-resource-stickiness: 0
expected-quorum-votes: 2
last-lrm-refresh: 1432927197
no-quorum-policy: freeze
stonith-enabled: false
I've been searching all over and I can't seem to figure it out. Any idea what is causing my issue or how to fix it?