I've been working with drbd for about a year now, and at this point I'm tearing my hair out in frustration. Every time there is a network fault (something that is disappointingly common in the environment I'm working in) a critical pair of servers split-brains and I have to manually intervene. For some background, these servers are in a master-slave configuration, and they perform hashing operations on files before distributing them to other servers around the world. They receive new files every 2-5 minutes, and the two must always be in sync so that should service fail over, the other server is not serving stale data. While this server-pair isn't in production yet, it's frustrating, since it's causing stale data to appear on one node every time there is a network issue. (alarmingly frequent, sadly)
How can I make drbd not split-brain every time there is a network issue? Or automate recovery? Here is the config for my drbd resources. I have it controlled by a cman stack.
resource foo {
handlers {
split-brain "/usr/local/bin/notify-split-brain.sh root";
}
protocol C;
meta-disk internal;
device /dev/drbd0;
net {
after-sb-0pri discard-younger-primary;
after-sb-1pri discard-secondary;
after-sb-2pri disconnect;
}
on nodea {
disk /dev/sdb;
address x.x.x.1:7789;
}
on nodeb {
disk /dev/sdb;
address x.x.x.2:7789;
}
}
This is running on CentOS Linux release 7.2.1511 (Core).