NFS cluster corosync + pacemaker + drbd

Question

I made up a NFS cluster with pacemaker DRBD and corosync with two nodes everything was working fine, on my tests trying different fail over scenario my cluster is completely broken, I can't no more switch to the primary node only the second one is working, so when I stop service on secondary node my service is down. I tried to resync disks, recreate the volume on primary server but pacemaker stops my service group because the volume can't be mounted

Here is my configurations

    logging {   
      debug: off   
      to_syslog: yes 
    } 
    nodelist {   
       node {
        name: nfs01-master
        nodeid: 1
        quorum_votes: 1
        ring0_addr: 10.x.x.150   
        }   
        node {
           name: nfs02-slave
           nodeid: 2
           quorum_votes: 1
           ring0_addr: 10.x.x.151   
        } 
   }
  quorum {   
  provider: corosync_votequorum 
} 
totem {   
  cluster_name: nfs-cluster-ha   
  config_version: 3   
  ip_version: ipv4   
  secauth: on   
  version: 2   
interface {
        bindnetaddr: 10.x.x.0
        ringnumber: 0   
  } 
}

DRBD configuration on both nodes

resource res1 {

    startup {
        wfc-timeout 30;
        degr-wfc-timeout 15;
    }

    disk {
    on-io-error detach;
    no-disk-flushes;
    no-disk-barrier;
    c-plan-ahead 0;
    c-fill-target 24M;
    c-min-rate 80M;
    c-max-rate 720M;
    }
  net {
    max-buffers 36k;
    sndbuf-size 1024k;
    rcvbuf-size 2048k;
  }

    syncer {
    rate 1000M;
}
    on nfs01-master {
        device /dev/drbd0;
        disk /dev/nfs01-master-vg/data;
        address 10.x.x.150:7788;
        meta-disk internal;
    }
    on nfs02-slave {
        device /dev/drbd0;
        disk /dev/nfs02-slave-vg/data;
        address 10.x.x.151:7788;
        meta-disk internal;
    }
}

When a failover occurs pacemakder can't mount /dev/drbd0 on nfs01-master and stuck as scondary but when I stop all the services other than DRBD I make it as master I am able to mount the partition

Pacemaker is configured as the follow

node 1: nfs01-master \
        attributes standby=off
node 2: nfs02-slave
primitive drbd_res1 ocf:linbit:drbd \
        params drbd_resource=res1 \
        op monitor interval=20s
primitive fs_res1 Filesystem \
        params device="/dev/drbd0" directory="/data" fstype=ext4
primitive nfs-common lsb:nfs-common
primitive nfs-kernel-server lsb:nfs-kernel-server
primitive virtual_ip_ens192 IPaddr2 \
        params ip=10.x.x.153 cidr_netmask=24 nic="ens192:1" \
        op start interval=0s timeout=60s \
        op monitor interval=5s timeout=20s \
        op stop interval=0s timeout=60s \
        meta failure-timeout=5s
group services fs_res1 virtual_ip_ens192 nfs-kernel-server nfs-common
ms ms_drbd_res1 drbd_res1 \
        meta master-max=1 master-node-max=1 clone-max=2 clone-node-max=1 notify=true
location drbd-fence-by-handler-res1-ms_drbd_res1 ms_drbd_res1 \
        rule $role=Master -inf: #uname ne nfs02-slave
location location_on_nfs01-master ms_drbd_res1 100: nfs01-master
order services_after_drbd inf: ms_drbd_res1:promote services:start
colocation services_on_drbd inf: services ms_drbd_res1:Master
property cib-bootstrap-options: \
        have-watchdog=false \
        dc-version=2.0.1-9e909a5bdd \
        cluster-infrastructure=corosync \
        cluster-name=debian \
        stonith-enabled=false \
        no-quorum-policy=ignore \
        last-lrm-refresh=1577978640 \
        stop-all-resources=false

and my NFS common

STATDOPTS="-n 10.x.x.153 --port 32765 --outgoing-port 32766"
NEED_IDMAPD=yes
NEED_GSSD=no

As I said as the secondary is alive the service works fine but the fail over doesn't work normally as the primary is alive it has priority on services it switches to the secondary only if the primary is down

On my primary

Stack: corosync
Current DC: nfs01-master (version 2.0.1-9e909a5bdd) - partition WITHOUT quorum
Last updated: Thu Jan  9 15:21:03 2020
Last change: Thu Jan  9 11:58:28 2020 by root via cibadmin on nfs02-slave

2 nodes configured
6 resources configured

Online: [ nfs01-master ]
OFFLINE: [ nfs02-slave ]

Full list of resources:

 Resource Group: services
     fs_res1    (ocf::heartbeat:Filesystem):    Stopped
     virtual_ip_ens192  (ocf::heartbeat:IPaddr2):   Stopped
     nfs-kernel-server  (lsb:nfs-kernel-server):    Stopped
     nfs-common (lsb:nfs-common):   Stopped
 Clone Set: ms_drbd_res1 [drbd_res1] (promotable)
     Slaves: [ nfs01-master ]
     Stopped: [ nfs02-slave ]

If anyone can help I would appreciate

Thank you.

tba · Answer 1 · 2020-01-10T14:41:10.393

2

I was able to solve my problem, I had a fencing option in my configuration when removed the cluster switches when a node is lost.

I removed this line

location drbd-fence-by-handler-res1-ms_drbd_res1 ms_drbd_res1 \
        rule $role=Master -inf: #uname ne nfs02-slave

I have to see more about fencing and not switch on the master when the disk is not sync if the slave writes on it

edited Jan 10 '20 at 14:41

answered Jan 10 '20 at 11:29

tba

21
3

NFS cluster corosync + pacemaker + drbd

1 Answers1