9

My Debian 8.9 DRBD 8.4.3 setup somehow has got into a state where the two nodes cannot connect over the network any more. They should replicate a single resource r1, but immediately after drbdadm down r1; drbadm up r1 on both nodes their /proc/drbd describe the situation as follows:

on 1st node (Connection State is either WFConnection or StandAlone):

1: cs:WFConnection ro:Primary/Unknown ds:UpToDate/DUnknown C r-----
   ns:0 nr:0 dw:0 dr:912 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:20

on 2nd node:

1: cs:StandAlone ro:Secondary/Unknown ds:UpToDate/DUnknown   r-----
   ns:0 nr:0 dw:0 dr:0 al:0 bm:0 lo:0 pe:0 ua:0 ap:0 ep:1 wo:f oos:48

The two nodes can ping each other over the IP addresses cited in /etc/drbd.d/r1.res, and netstat shows that both are listening on the cited port.

How can I (further diagnose and) get out of this situation so that the two nodes can become Connected and replicate over DRBD again?

BTW, on a higher level of abstraction this problem currently manifests itself by systemctl start drbd never exiting, apparently because it gets stuck in drbdadm wait-connect all (as suggested by /lib/systemd/system/drbd.service).

rookie09
  • 623
  • 1
  • 6
  • 17

2 Answers2

16

The situation was apparently caused by a case of split-brain.

I had not noticed this because I had only inspected recent journal entries for drbd.service (sudo journalctl -u drbd), but the problem apparently was reported in other kernel logs and slightly earlier (sudo journalctl | grep Split-Brain).

With that, manually solving the split-brain (as described here or here) also resolved the troublesome situation as follows.

On split-brain victim (assuming the DRBD resource is r1):

drbdadm disconnect r1
drbdadm secondary r1
drbdadm connect --discard-my-data r1

On split-brain survivor:

drbdadm primary r1
drbdadm connect r1
Tr33beard
  • 3
  • 3
rookie09
  • 623
  • 1
  • 6
  • 17
  • 2
    It's best to include your steps in your answer versus linking to a site that might move later. I imagine you just needed `drbdadm disconnect r1` on both nodes, then `drbdadm connect r1 --discard-my-data` on the victim, and `drbdadm connect r1` on the survivor. – Matt Kereczman Aug 25 '17 at 14:44
  • @MattKereczman Done now. – rookie09 Aug 31 '17 at 06:11
0

I use the following pattern: On Sick Node(Which is not Current DC, run pcs status)

drbdadm dump all
drbdadm disconnect resource
drbdadm secondary resource
drbdadm connect resource

On Healthy Node (Which is current DC, run pcs status )

drbdadm dump all
drbdadm disconnect resource
drbdadm primary resource
drbdadm connect resource
sysadmin1138
  • 133,124
  • 18
  • 176
  • 300