1

One of my clients would like to run consul in two datacenters. Both datacenters Berlin and Frankfurt should result in a High Availability setup, where one datacenter can be taken offline or die without affecting the other datacenter.

Both datacenters should have the same data consistency state and we've run into some problems.

Attempt 1: a cluster with a node in Berlin and a node in Frankfurt, common "datacenter" option

I would have expected, that a simple consul cluster of two server nodes would work. But as soon as the connection between the two breaks, none of them is able to function, since no leader is elected anymore.

Attempt 2: a cluster with a server node in Berlin and an agent in Frankfurt, common "datacenter" option

In this setup, when the split occurs, the server node in Berlin still works, but the agent node in Frankfurt stops to work. This really begged the question, what the use of agents is in consul anyway.

Attempt 3: two clusters with the "datacenter" option set to two different values

With one cluster / server in Berlin and having the datacenter option set to "Berlin" as well and another cluster / server in Frankfurt with the datacenter option set to "Frankfurt", the storage is sharded. Both work after the split, but keys in Berlin cannot be accessed from Frankfurt and vice-versa.

This requires both clusters to be updated somehow. One solution is a little daemon based on consul-replicate, but this adds a Single-Point-Of-Failure to the whole setup, which we would like to avoid.

Attempt 4: Get another outside datacenter

The only thing we could think of is to get a third, outside node running in the hope, that not more than one node ever fails. This opens up a different box of things to think about (backup, security, updates, etc.), which we would like to avoid.


For now, our best option is to run two completely separated clusters and update the keys through an external pipeline. This might lead to inconcistencies, but works for our setup.

However, this really left us wondering, if Consul really prefers a Denial-of-Service over a possible inconcistency.

Did we miss something? Is there another way of having an automatically replicated Consul-Cluster across two datacenters, which still works in split-brain situations or a downed datacenter?

Lars
  • 486
  • 5
  • 21

0 Answers0