1

I am reading about distributed system and getting confused between Quorum, Consensus and Vector Clock.

Can someone please explain them with examples?

Peter Csala
  • 17,736
  • 16
  • 35
  • 75
Kumar
  • 1,536
  • 2
  • 23
  • 33
  • a bit hard to explain all this here. Don't you have a more precise question? – OznOg Aug 16 '22 at 16:57
  • @OznOg: I am getting confused if the system has strong Read / Write Quorum then other nodes should just replicate the same value...why do we require RAFT / Paxos kind of algorithm... – Kumar Aug 16 '22 at 17:00

2 Answers2

4

Let's also add Version Vector to your questions :)

There are various problems to tackle in distributes systems. And there are different tools to solve those challenges.

Problem1: I'd like to make a decision involving specific number of nodes. We will call that number - quorum. For example, in leaderless replication based on Dynamo, quorum is a number of nodes representing a majority.

To be clear, quorum does not have to be a majority - it all depends on specifics of the problem. E.g. you could say something like - in system X a quorum is a set of three oldest nodes.

Problem2: We have multiple nodes, we want them all to agree on something - we want nodes to get to a Consensus on a specific decision. E.g. there are 10 numbers (0..9) and 100 nodes. We want them all to pick the same number. So, the consensus is a general idea of agreement on something. Common algorithms are Paxos, Raft, etc.

Problem 3: I have a distributed system which processes events on each node. Some of those events will be concurrent to each other. How do I detect those? I'll use version clock for that.

Problem 4: I have several replicas of some data. These replicas may process some events locally and also synchronize to each other. When I do synchronize, how do I know which replica is more recent? And how do I detect if replicas have conflicting data? I'll use version vector for this.

AndrewR
  • 1,252
  • 8
  • 7
  • Thanks a lot for your answer. My confusion is, if there is quorum, do we require to do something for Consensus (like raft, paxos algorithm)...because if there is quorum that itself is Consensus in one sense... – Kumar Aug 17 '22 at 06:48
  • hi, Ii think I failed to deliver the main idea - it all depends on the problem you have in hands. e.g. "if there is quorum, is consensus required" - I don't know - what is the problem you are solving? For instance, quorums are used in both Dynamo style replication and in Raft (in both cases quorum is the majority) – AndrewR Aug 17 '22 at 15:01
  • Thanks! I am not solving any problem, just trying to get hang of things in distributed systems world! From high level quorum seems to solve most of the problems, so was wondering where it falls short that we need more complex consensus algorithm(like raft, paxos etc.). Further, was wondering if for any use case they are required to be used together....Apologies, if I am too vague, as I still trying to understand these things :-( – Kumar Aug 17 '22 at 16:29
  • Not a problem, we all start somewhere. I had a "break through" in distributed systems after I collected a list of various problems, which arise as soon as data gets distributed. Something like "consistent prefix read" and then I was able to research how to solve them. And after some time quantity of knowledge transformed to quality. The most challenging part of distributed system are all those non obvious issues. – AndrewR Aug 17 '22 at 19:24
2

Martin Kleppmann has written an excellent book called Designing Data-Intensive Applications.

In this book Martin has described all of these concepts in great detail.

Let me quote here some excerpts of the related discussions:

Version Vector, Version Clocks

The example in Figure 5-13 used only a single replica. How does the algorithm change when there are multiple replicas, but no leader?

Figure 5-13 uses a single version number to capture dependencies between operations, but that is not sufficient when there are multiple replicas accepting writes concurrently. Instead, we need to use a version number per replica as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the version numbers it has seen from each of the other replicas. This information indicates which values to overwrite and which values to keep as siblings.

The collection of version numbers from all the replicas is called a version vector [56]. A few variants of this idea are in use, but the most interesting is probably the dotted version vector [57], which is used in Riak 2.0 [58, 59]. We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.

Like the version numbers in Figure 5-13, version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls causal context.) The version vector allows the database to distinguish between overwrites and concurrent writes.

Also, like in the single-replica example, the application may need to merge siblings. The version vector structure ensures that it is safe to read from one replica and subsequently write back to another replica. Doing so may result in siblings being created, but no data is lost as long as siblings are merged correctly.

Version vectors and vector clocks

A version vector is sometimes also called a vector clock, even though they are not quite the same. The difference is subtle — please see the references for details [57, 60, 61]. In brief, when comparing the state of replicas, version vectors are the right data structure to use.

Quorums for reading and writing

In the example of Figure 5-10, we considered the write to be successful even though it was only processed on two out of three replicas. What if only one out of three replicas accepted the write? How far can we push this?

If we know that every successful write is guaranteed to be present on at least two out of three replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas, we can be sure that at least one of the two is up to date. If the third replica is down or slow to respond, reads can nevertheless continue returning an up- to-date value.

More generally, if there are n replicas, every write must be confirmed by w nodes to be considered successful, and we must query at least r nodes for each read. (In our example, n = 3, w = 2, r = 2.) As long as w + r > n, we expect to get an up-to-date value when reading, because at least one of the r nodes we’re reading from must be up to date. Reads and writes that obey these r and w values are called quorum reads and writes [44]. You can think of r and w as the minimum number of votes required for the read or write to be valid.

In Dynamo-style databases, the parameters n, w, and r are typically configurable. A common choice is to make n an odd number (typically 3 or 5) and to set w = r = (n + 1) / 2 (rounded up). However, you can vary the numbers as you see fit. For example, a workload with few writes and many reads may benefit from setting w = n and r = 1. This makes reads faster, but has the disadvantage that just one failed node causes all database writes to fail.

There may be more than n nodes in the cluster, but any given value is stored only on n nodes. This allows the dataset to be partitioned, supporting datasets that are larger than you can fit on one node. We will return to partitioning in Chapter 6.

The quorum condition, w + r > n, allows the system to tolerate unavailable nodes as follows:

  • If w < n, we can still process writes if a node is unavailable.
  • If r < n, we can still process reads if a node is unavailable.
  • With n = 3, w = 2, r = 2 we can tolerate one unavailable node.
  • With n = 5, w = 3, r = 3 we can tolerate two unavailable nodes. This case is illustrated in Figure 5-11.
  • Normally, reads and writes are always sent to all n replicas in parallel. The parameters w and r determine how many nodes we wait for—i.e., how many of the n nodes need to report success before we consider the read or write to be successful.

enter image description here Figure 5-11. If w + r > n, at least one of the r replicas you read from must have seen the most recent successful write.

If fewer than the required w or r nodes are available, writes or reads return an error. A node could be unavailable for many reasons: because the node is down (crashed, powered down), due to an error executing the operation (can’t write because the disk is full), due to a network interruption between the client and the node, or for any number of other reasons. We only care whether the node returned a successful response and don’t need to distinguish between different kinds of fault.

Distributed Transactions and Consensus

Consensus is one of the most important and fundamental problems in distributed computing. On the surface, it seems simple: informally, the goal is simply to get several nodes to agree on something. You might think that this shouldn’t be too hard. Unfortunately, many broken systems have been built in the mistaken belief that this problem is easy to solve.

Although consensus is very important, the section about it appears late in this book because the topic is quite subtle, and appreciating the subtleties requires some prerequisite knowledge. Even in the academic research community, the understanding of consensus only gradually crystallized over the course of decades, with many misunderstandings along the way. Now that we have discussed replication (Chapter 5), transactions (Chapter 7), system models (Chapter 8), linearizability, and total order broadcast (this chapter), we are finally ready to tackle the consensus problem.

There are a number of situations in which it is important for nodes to agree. For example:

Leader election

In a database with single-leader replication, all nodes need to agree on which node is the leader. The leadership position might become contested if some nodes can’t communicate with others due to a network fault. In this case, consensus is important to avoid a bad failover, resulting in a split brain situation in which two nodes both believe themselves to be the leader (see “Handling Node Outages” on page 156). If there were two leaders, they would both accept writes and their data would diverge, leading to inconsistency and data loss.

Atomic commit

In a database that supports transactions spanning several nodes or partitions, we have the problem that a transaction may fail on some nodes but succeed on others. If we want to maintain transaction atomicity (in the sense of ACID; see “Atomicity” on page 223), we have to get all nodes to agree on the outcome of the transaction: either they all abort/roll back (if anything goes wrong) or they all commit (if nothing goes wrong). This instance of consensus is known as the atomic commit problem.


The Impossibility of Consensus

You may have heard about the FLP result [68]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms for achieving consensus. What is going on here?

The answer is that the FLP result is proved in the asynchronous system model (see “System Model and Reality” on page 306), a very restrictive model that assumes a deterministic algorithm that cannot use any clocks or timeouts. If the algorithm is allowed to use timeouts, or some other way of identifying suspected crashed nodes (even if the suspicion is sometimes wrong), then consensus becomes solvable [67]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [69].

Thus, although the FLP result about the impossibility of consensus is of great theoretical importance, distributed systems can usually achieve consensus in practice.


In this section we will first examine the atomic commit problem in more detail. In particular, we will discuss the two-phase commit (2PC) algorithm, which is the most common way of solving atomic commit and which is implemented in various databases, messaging systems, and application servers. It turns out that 2PC is a kind of consensus algorithm—but not a very good one [70, 71].

By learning from 2PC we will then work our way toward better consensus algorithms, such as those used in ZooKeeper (Zab) and etcd (Raft).

Further reads

Peter Csala
  • 17,736
  • 16
  • 35
  • 75