0

This is a setup I inherited, and that is really old (running drbd 8.3). I tried drbdadm connect drbd0, drbdadm primary -f drbd0, but everything comes back with Need access to UpToDate data.

I presume that is because of being Inconsistent.

[root@node-01 ~]# drbd-overview
  0:drbd0  StandAlone Secondary/Unknown   Inconsistent/Outdated r-----
  1:drbd1  Connected  Secondary/Secondary UpToDate/UpToDate     C      r-----

[root@node-02 ~]# drbd-overview
  0:drbd0  WFConnection Secondary/Unknown   Inconsistent/DUnknown C r-----
  1:drbd1  Connected    Secondary/Secondary UpToDate/UpToDate     C r-----

How can I fix this, without nuking the data on it?

When I did drbdadm connect drbd0 the system log says:

block drbd0: conn( StandAlone -> Unconnected )
block drbd0: Starting receiver thread (from drbd0_worker [6860])
block drbd0: receiver (re)started
block drbd0: conn( Unconnected -> WFConnection )
block drbd0: Handshake successful: Agreed network protocol version 96
block drbd0: conn( WFConnection -> WFReportParams )
block drbd0: Starting asender thread (from drbd0_receiver [21821])
block drbd0: data-integrity-alg: <not-used>
block drbd0: drbd_sync_handshake:
block drbd0: self AA586D9040BXXXX:7DF55F42BF95XXXX:7DF45F42BF95XXXX:DC31D449C727XXXX bits:416 flags:0
block drbd0: peer 7DF55F42BF9XXXX:0000000000000000:DC31D449C727EE27:DC30D449C727XXXX bits:416 flags:0
block drbd0: uuid_compare()=1 by rule 70
block drbd0: I shall become SyncSource, but I am inconsistent!
block drbd0: conn( WFReportParams -> Disconnecting )
block drbd0: error receiving ReportState, l: 4!
block drbd0: asender terminated
block drbd0: Terminating asender thread
block drbd0: Connection closed
block drbd0: conn( Disconnecting -> StandAlone )
block drbd0: receiver terminated
block drbd0: Terminating receiver thread
raarts
  • 103
  • 1
  • 1
  • 5
  • Node 01 is currently in "StandAlone". Try to re-establish a connection firstly with 'drbdadm connect drbd0'. Afterwards, check the logs for some additional clues. – Dok Oct 02 '18 at 15:59
  • @Dok: I included the syslog output – raarts Oct 02 '18 at 16:05
  • @raarts: I saw you join #drbd on Freenode and ask this question. I would have answered you there if you would have stuck around (I'm on PDT time) ;) Either way, my answer below _should_ be what you're looking for. If it doesn't work for some reason let me know in a comment and I'll help you through it. – Matt Kereczman Oct 04 '18 at 18:21

1 Answers1

4

Neither node has UpToDate data, so DRBD will not be able to go Primary without some convincing. You'll need to force a node into Primary.

Which ever node you run the following command on should become the SyncSource, so be sure you choose the node you believe to have good data.

drbdadm -- --overwrite-data-of-peer primary <resource>

If you're not sure, I would disconnect the resource on both nodes so they're both StandAlone, run the above command on one node, promote that node to Primary, and then inspect the data. Then repeat on the other node. Once you know where the good data is, you can demote both sides and resolve the split-brain in the correct direction by telling the split-brain victim to discard his data using: drbdadm -- --discard-my-data connect <resource>, and simply connecting the split-brain survivor: drbdadm connect <resource>.

Hope that helps!

Matt Kereczman
  • 1,899
  • 9
  • 12
  • Matt, thanks, I did, and my Xen cluster (which runs on this drbd) came back online, but I now get the following on node1: `0:drbd0 Connected Primary/Secondary UpToDate/Inconsistent C r-----` and this on node2: `0:drbd0 Connected Secondary/Primary Inconsistent/UpToDate`. This means it's still not syncing? – raarts Oct 05 '18 at 09:23
  • @raarts: On the `Secondary` node run the following: `drbdadm disconnect drbd0 && drbdadm connect drbd0`. Is this version of DRBD older than 8.3.8.1? There was a race condition fixed in 8.3.8.1 that caused similar symptoms. – Matt Kereczman Oct 05 '18 at 16:33
  • I ran it, it started syncing (was done in 20 seconds: `[=========>..........] sync'ed: 53.1% (435124/923908)K` , two machines connected over 100Mb, but ended up in the same state. Syslog said: `Resync done (total 43 sec; paused 0 sec; 21484 K/sec) 640 failed blocks`. Am I in trouble? – raarts Oct 05 '18 at 19:29
  • Ok, I found the problem. Disk I/O errors on the first node. Thanks for helping! – raarts Oct 06 '18 at 08:29