I made up a NFS cluster with pacemaker DRBD and corosync with two nodes everything was working fine, on my tests trying different fail over scenario my cluster is completely broken, I can't no more switch to the primary node only the second one is working, so when I stop service on secondary node my service is down. I tried to resync disks, recreate the volume on primary server but pacemaker stops my service group because the volume can't be mounted
Here is my configurations
logging {
debug: off
to_syslog: yes
}
nodelist {
node {
name: nfs01-master
nodeid: 1
quorum_votes: 1
ring0_addr: 10.x.x.150
}
node {
name: nfs02-slave
nodeid: 2
quorum_votes: 1
ring0_addr: 10.x.x.151
}
}
quorum {
provider: corosync_votequorum
}
totem {
cluster_name: nfs-cluster-ha
config_version: 3
ip_version: ipv4
secauth: on
version: 2
interface {
bindnetaddr: 10.x.x.0
ringnumber: 0
}
}
DRBD configuration on both nodes
resource res1 {
startup {
wfc-timeout 30;
degr-wfc-timeout 15;
}
disk {
on-io-error detach;
no-disk-flushes;
no-disk-barrier;
c-plan-ahead 0;
c-fill-target 24M;
c-min-rate 80M;
c-max-rate 720M;
}
net {
max-buffers 36k;
sndbuf-size 1024k;
rcvbuf-size 2048k;
}
syncer {
rate 1000M;
}
on nfs01-master {
device /dev/drbd0;
disk /dev/nfs01-master-vg/data;
address 10.x.x.150:7788;
meta-disk internal;
}
on nfs02-slave {
device /dev/drbd0;
disk /dev/nfs02-slave-vg/data;
address 10.x.x.151:7788;
meta-disk internal;
}
}
When a failover occurs pacemakder can't mount /dev/drbd0 on nfs01-master and stuck as scondary but when I stop all the services other than DRBD I make it as master I am able to mount the partition
Pacemaker is configured as the follow
node 1: nfs01-master \
attributes standby=off
node 2: nfs02-slave
primitive drbd_res1 ocf:linbit:drbd \
params drbd_resource=res1 \
op monitor interval=20s
primitive fs_res1 Filesystem \
params device="/dev/drbd0" directory="/data" fstype=ext4
primitive nfs-common lsb:nfs-common
primitive nfs-kernel-server lsb:nfs-kernel-server
primitive virtual_ip_ens192 IPaddr2 \
params ip=10.x.x.153 cidr_netmask=24 nic="ens192:1" \
op start interval=0s timeout=60s \
op monitor interval=5s timeout=20s \
op stop interval=0s timeout=60s \
meta failure-timeout=5s
group services fs_res1 virtual_ip_ens192 nfs-kernel-server nfs-common
ms ms_drbd_res1 drbd_res1 \
meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
location drbd-fence-by-handler-res1-ms_drbd_res1 ms_drbd_res1 \
rule $role=Master -inf: #uname ne nfs02-slave
location location_on_nfs01-master ms_drbd_res1 100: nfs01-master
order services_after_drbd inf: ms_drbd_res1:promote services:start
colocation services_on_drbd inf: services ms_drbd_res1:Master
property cib-bootstrap-options: \
have-watchdog=false \
dc-version=2.0.1-9e909a5bdd \
cluster-infrastructure=corosync \
cluster-name=debian \
stonith-enabled=false \
no-quorum-policy=ignore \
last-lrm-refresh=1577978640 \
stop-all-resources=false
and my NFS common
STATDOPTS="-n 10.x.x.153 --port 32765 --outgoing-port 32766"
NEED_IDMAPD=yes
NEED_GSSD=no
As I said as the secondary is alive the service works fine but the fail over doesn't work normally as the primary is alive it has priority on services it switches to the secondary only if the primary is down
On my primary
Stack: corosync
Current DC: nfs01-master (version 2.0.1-9e909a5bdd) - partition WITHOUT quorum
Last updated: Thu Jan 9 15:21:03 2020
Last change: Thu Jan 9 11:58:28 2020 by root via cibadmin on nfs02-slave
2 nodes configured
6 resources configured
Online: [ nfs01-master ]
OFFLINE: [ nfs02-slave ]
Full list of resources:
Resource Group: services
fs_res1 (ocf::heartbeat:Filesystem): Stopped
virtual_ip_ens192 (ocf::heartbeat:IPaddr2): Stopped
nfs-kernel-server (lsb:nfs-kernel-server): Stopped
nfs-common (lsb:nfs-common): Stopped
Clone Set: ms_drbd_res1 [drbd_res1] (promotable)
Slaves: [ nfs01-master ]
Stopped: [ nfs02-slave ]
If anyone can help I would appreciate
Thank you.