6

I have a HA cluster with two nodes, node one is the primary and node 2 is its mirror. I have a problem in the mysql resource since my nodes are not synchronized

drbd-overview

Node Principal:
0:home Connected Primary/Secondary UpToDate/UpToDate C r-----
1:storage Connected Secondary/Primary UpToDate/UpToDate C r-----
2:mysql StandAlone Secondary/Unknown UpToDate/Outdated r-----

Node Secundary:
0:home Connected Secondary/Primary UpToDate/UpToDate C r-----
1:storage Connected Primary/Secondary UpToDate/UpToDate C r-----
2:mysql StandAlone Primary/Unknown UpToDate/Outdated r-----

Reviewing the messages file I found the following

Apr-19 18:20:36 clsstd2 kernel: block drbd2:self C1480E287A8CAFAB:C7B94724E2658B94:5CAE57DEB3EDC4EE:F5887A918B55FB1A bits:114390101 flags:0
Apr-19 18:20:36 clsstd2 kernel: block drbd2:peer 719D326BDE8272E2:0000000000000000:C7BA4724E2658B94:C7B94724E2658B95 bits:0 flags:1 
                                                         
Apr-19 18:20:36 clsstd2 kernel: block drbd2:uuid_compare()=-1000 by rule 100                           
Apr-19 18:20:37 clsstd2 kernel: block drbd2:Unrelated data, aborting!
Apr-19 18:20:37 clsstd2 kernel: block drbd2:conn (WFReportParams -> Disconnecting)
Apr-19 18:20:37 clsstd2 kernel: block drbd2:error receiving ReportState, l: 4!
Apr-19 18:20:38 clsstd2 kernel: block drbd2:asender terminated
Apr-19 18:20:38 clsstd2 kernel: block drbd2:Terminating asender thread
Apr-19 18:20:38 clsstd2 kernel: block drbd2:Connection closed
Apr-19 18:20:38 clsstd2 kernel: block drbd2:conn (Disconnecting -> StandAlone)
Apr-19 18:20:39 clsstd2 kernel: block drbd2:reciver terminated
Apr-19 18:20:39 clsstd2 kernel: block drbd2:Terminating reciver thread
Apr-19 18:20:39 clsstd2 auditd[3960]: Audit daemon rotating log files

I don't understand what the problem is and how I can solve it, since checking both nodes I realized that in the var/lib/mysql directory I don't have the ibdata1 file in node 2 but it does exist in node1

Rick James
  • 2,463
  • 1
  • 6
  • 13
Iván Jf
  • 111
  • 5

3 Answers3

5

The problem is you caught DRBD split brain condition and both nodes went to “StandAlone” state. It’s difficult to say do your have valid or corrupted DB on your primary node, but for now you have two routes to chose from:

(1) Try to resync the nodes assigning one of them as having more recent version of the data (not necessary your case).

(This is what you run on the second node…)

#drbdadm secondary resource 
#drbdadm disconnect resource
#drbdadm -- --discard-my-data connect resource

(This is what you run on your alive node, one you think having the most recent version of the data…)

#drbdadm connect resource

If it won’t help you can trash second node and imitate automatic rebuild doing…

#drbdadm invalidate resource

(2) Purge both nodes data with the last command from (1) and recover your DB from backups.

Hope this helps!

P.S. I would really recommend avoiding DRBD in production. What your see is a quite common thing, unfortunately.

wazoox
  • 6,918
  • 4
  • 31
  • 63
BaronSamedi1958
  • 13,676
  • 1
  • 21
  • 53
  • 4
    Right, this is a split brain in DRBD and possibly, there is a following message in the logs: "kernel: block drbd0: Split-Brain detected, dropping connection!" (although it's not always detected). Route 1 is worth trying. Just an example to illustrate: https://www.suse.com/support/kb/doc/?id=000019009. And you're right, DRBD is well-known for this issue. To avoid it, either use Quorum with a third node or go for something that works properly on 2 nodes like StarWind vSAN for example. – Strepsils May 04 '23 at 07:47
1

The issue here is the Unrelated data, aborting! you see within the logs. Likely the nodes have changed roles enough times, while disconnected, that the historical generation identified within the meta-data no longer match. See the DRBD User's Guide here for further information: https://linbit.com/drbd-user-guide/drbd-guide-9_0-en/#s-gi

At this point, you will need to select a node to overwrite the data of the other and perform a new full sync. To do this you should recreate the meta-data on the node to be the SyncTarget. You can do this with a drbdadm create-md <resource>

Dok
  • 1,158
  • 1
  • 7
  • 13
  • Thank you for answering, when performing these steps, the data of the main node is not at risk? – Iván Jf Apr 28 '23 at 21:33
  • As long as you do not recreate the metadata on the primary node, it will automatically be chosen at the SyncSource once they connect. – Dok Apr 28 '23 at 23:19
  • Thanks, you were right, the solution was to recreate the metadata – Iván Jf May 08 '23 at 21:22
1

Thank you, indeed, the solution was to create the metadata again, run the following commands on the node where I want to recreate the metadata and now everything is synchronized again.

drbdadm down resource    
drbdadm wipe-md resource    
drbdadm create-md resource    
drbdadm up resource    
drbdadm disconnect resource    
drbdadm connect resource

the last command is executed first on the node where the metadata is recreated and then on the other node.

por ultimo, se ejecuta el comando #cat/proc/drbd y se puede ver la secuencia de sincronización.

Iván Jf
  • 111
  • 5
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community May 17 '23 at 10:37