0

I am trying to understand the recovering process of a promotable resource after "pcs cluster stop --all" and shutdown of both nodes. I have a two nodes + qdevice quorum with a DRBD resource.

This is a summary of the resources before my test. Everything is working just fine and server2 is the master of DRBD.

 * fence-server1    (stonith:fence_vmware_rest):     Started server2
 * fence-server2    (stonith:fence_vmware_rest):     Started server1
 * Clone Set: DRBDData-clone [DRBDData] (promotable):
   * Masters: [ server2 ]
   * Slaves: [ server1 ]
 * Resource Group: nfs:
   * drbd_fs    (ocf::heartbeat:Filesystem):     Started server2

then I issue "pcs cluster stop --all". The cluster will be stopped on both nodes as expected. Now I restart server1( previously the slave ) and poweroff server2 ( previously the master ). When server1 restarts it will fence server2 and I can see that server2 is starting on vcenter, but I just pressed any key on grub to make sure the server2 would not restart, instead it would just be "paused" on grub screen.

SSH'ing to server1 and running pcs status I get:

Cluster name: cluster1
Cluster Summary:
  * Stack: corosync
  * Current DC: server1 (version 2.1.0-8.el8-7c3f660707) - partition with quorum
  * Last updated: Mon May  2 09:52:03 2022
  * Last change:  Mon May  2 09:39:22 2022 by root via cibadmin on server1
  * 2 nodes configured
  * 11 resource instances configured

Node List:
  * Online: [ server1 ]
  * OFFLINE: [ server2 ]

Full List of Resources:
  * fence-server1    (stonith:fence_vmware_rest):     Stopped
  * fence-server2    (stonith:fence_vmware_rest):     Started server1
  * Clone Set: DRBDData-clone [DRBDData] (promotable):
    * Slaves: [ server1 ]
    * Stopped: [ server2 ]
  * Resource Group: nfs:
    * drbd_fs    (ocf::heartbeat:Filesystem):     Stopped

Here are the contraints:

# pcs constraint
Location Constraints:
  Resource: fence-server1
    Disabled on:
      Node: server1 (score:-INFINITY)
  Resource: fence-server2
    Disabled on:
      Node: server2 (score:-INFINITY)
Ordering Constraints:
  promote DRBDData-clone then start nfs (kind:Mandatory)
Colocation Constraints:
  nfs with DRBDData-clone (score:INFINITY) (rsc-role:Started)
(with-rsc-role:Master)
Ticket Constraints:

# sudo crm_mon -1A
...
Node Attributes:
  * Node: server2:
    * master-DRBDData                     : 10000

So I can see there is quorum, but the server1 is never promoted as DRBD master, so the remaining resources will be stopped until server2 is back.

  1. What do I need to do to force the promotion and recover without restarting server2?
  2. Why if instead of rebooting server1 and power off server2 I reboot server2 and poweroff server1 the cluster can recover by itself?
  3. Does that mean that for some reason during the "cluster stop --all" the drbd data got out of sync?
Jose
  • 21
  • 1
  • 2

1 Answers1

0

I ran into the exact same issue with my setup since mine is almost a carbon copy of yours and I eventually managed to make it work. (I was testing if there was a power outage and all servers in the cluster turned off and only one storage node came back.)

Not sure of your setup - I have a diskless witness for DRBD with a quorum setting of 1; the witness is also used as a qdevice for the cluster. I checked the status of the DRBD resource on the available node - it was Secondary, with Connecting on the downed node, and Diskless (Connected/Secondary) on the witness node. I checked the status of the cluster quorum and made sure it was quorate.

After that, I made the DRBD resource primary on available node. I eventually figured out if I (temporarily) disable STONITH on the cluster, the DRBD resource and subsequent resources started immediately and in order. After 'fixing' the downed node, I reenabled STONITH and made sure the resources could move around properly.

ty9000
  • 1