3

I am building an app in golang that I would like to be fault-tolerant. I looked at different algorithms like RAFT and Paxos and their implementations in golang (etcd's raft, hashicorp's raft), but I feel like they might be an overkill for my specific use case.

In my application, the nodes just wait in standby and act as failovers in case the leader fails. I do not need to replicate any states throughout the cluster. All I need is the following properties:

If a node is a leader:

  • Run a given code

If a node is not a leader:

  • Wait for a leader to fail
  • Reelect the leader once the existing leader fails

Any suggestions?

Aibek
  • 318
  • 3
  • 11
  • 1
    Riding on the raft of etcd is a good idea. I think you can look at [leases](https://etcd.io/docs/v3.3.12/dev-guide/interacting_v3/#grant-leases). – George Leung Apr 20 '20 at 07:09
  • It is appealing, however, I still have the problem. There is no "Client" that would make requests to this application. And hence I am not really trying to replicate any logs across the instances of the application(and it seems to me that RAFT was build with those assumptions in mind). All I am trying to do is ensure that only one instance is running at a time, and if that instance fails, another instance will pick it up. – Aibek Apr 21 '20 at 01:09

2 Answers2

1

Since you want a leader election protocol it sounds like you want to avoid having more than one node acting as the leader at once. The answer really depends on how strictly you require this property. In some cases it is acceptable to occasionally have more than one node acting as the leader; perhaps the worst that happens is a bit of duplicated work. In other cases the whole system may operate incorrectly if there's ever any duplicate leaders, so you must be much more careful.

If you can accept occasional cases of duplicate leaders then a simpler protocol may be for you. However, if you absolutely cannot tolerate having more than one leader at once then you will have to combine your leader election protocol with some kind of replication of state, and a proven implementation of Paxos or Raft or similar is a very good way to do this. There's lots of subtly different protocols for this but they're all basically doing the same thing.

The fundamental problem here is pinning down what "at once" means in a realistic network in which messages may sometimes be delivered after a very long delay. Typically one assumes that the network is completely asynchronous with no time bounds on delivery, and indeed Paxos, Raft etc. are all designed to work correctly under that assumption. These algorithms work around this by defining their own internal notion of time (ballots in Paxos, terms in Raft) and attaching this "internal time" to all state transitions under their control. This gives some very strong guarantees and, in particular, ensures that no two nodes may take actions as leader at the same "internal time".

If you don't replicate any state via something like Paxos or Raft then you won't be able to make use of this strong notion of internal time.

Dave Turner
  • 1,846
  • 16
  • 27
0

You can use the client go Kubernetes library if you will be deploying it in a Kubernetes cluster for your specific use case. https://github.com/kubernetes-client/go

coder
  • 33
  • 9