Adding network-fault tolerance to drbd or drbd alternatives

Question

I've been working with drbd for about a year now, and at this point I'm tearing my hair out in frustration. Every time there is a network fault (something that is disappointingly common in the environment I'm working in) a critical pair of servers split-brains and I have to manually intervene. For some background, these servers are in a master-slave configuration, and they perform hashing operations on files before distributing them to other servers around the world. They receive new files every 2-5 minutes, and the two must always be in sync so that should service fail over, the other server is not serving stale data. While this server-pair isn't in production yet, it's frustrating, since it's causing stale data to appear on one node every time there is a network issue. (alarmingly frequent, sadly)

How can I make drbd not split-brain every time there is a network issue? Or automate recovery? Here is the config for my drbd resources. I have it controlled by a cman stack.

resource foo {
    handlers {
            split-brain "/usr/local/bin/notify-split-brain.sh root";
    }
    protocol C;
    meta-disk internal;
    device /dev/drbd0;
    net {
            after-sb-0pri discard-younger-primary;
            after-sb-1pri discard-secondary;
            after-sb-2pri disconnect;
    }
    on nodea {
            disk /dev/sdb;
            address x.x.x.1:7789;
    }
    on nodeb {
            disk /dev/sdb;
            address x.x.x.2:7789;
    }
}

This is running on CentOS Linux release 7.2.1511 (Core).

Any chance to get a more reliable network? Multiple connections for example? — gxx, Jan 26 '16 at 17:20
these are running on VMs that are on a trunked VLAN to the hosts - So anything that affects the hosts will affect everything living on the hosts. Assume nothing can be done for the network. — Oblivious12, Jan 26 '16 at 18:15

Adding network-fault tolerance to drbd or drbd alternatives

0 Answers0