Perform reading from a paxos-based distributed cluster

Question

Could any one help introduce how to read contents from the distributed cluster?

I mean there is a distributed cluster who's consistency is guaranteed by Paxos algorithm.

In real-world application, how does the client read the contents they have written to the cluster?

For example, in a 5 servers cluster, maybe only 3 of them get the newest data and the other 2 have old data due to network delay or something.

Does this means the client needs to read at least majority of all nodes? In 5-servers, it means reading data from at least 3 servers and checked the one with newest version number?

If so, it seems quite slow since you need to read 3 copies? How does the real world implement this ?

if the client reads from multiple nodes it has to deal with the fact that the messages may get lost, duplicated, delayed, reordered. imagine if the cluster was just replicating a key-value store (map) and you asked three nodes `getKey(1)` and you got three responses at three times saying `null`, `10`, `4` due to replication delays between the nodes and message delays from the client to the cluster nodes. so you *must* read form the leader in paxos and for the lead to know it is still the master at the point it responds it needs to exchange messages with a majority of the cluster. — simbo1905, Nov 05 '14 at 21:32

score 2 · Answer 1 · edited May 23 '17 at 12:30

Clients should read from the leader. If a node knows it is not the leader it should redirect the client to the leader. If a node does not know who is leader it should throw an error and the client should pick another node at random until it is told or finds the leader. If the node thinks it is the leader it is dangerous to return a read from local state as it may have just lost connectivity to the rest of the cluster right when it gets a massive stall (cpu load, io stall, vm overload, large gc, some background task, server maintenance job, ...) such that it actually looses the leadership during replying to the client and gives out a stale read. This can be avoided by running a round of (multi)Paxos for the read.

Lamport Clocks and Vector Clock say you must pass messages to assign that operation A happens before operation B when they run on different machines. If not they run concurrently. This provides the theoretic underpinning as to why we cannot say a read from a leader is not stale without exchanging messages with the majority of the cluster. The message exchange establishes a "happened-before" relationship of the read to the next write (which may happen on a new leader due to a failover).

The leader itself can be an acceptor and so in a three node cluster it just needs one response from one other node to complete a round of (multi)Paxos. It can send messages in parallel and reply to the client when it gets the first response. The network between nodes should be dedicated to intra-cluster traffic (and the best you can get) such that this does not add much latency to the client.

There is an answer which describes how Paoxs can be used for a locking service which cannot tolerate stale reads or reordered writes where a crash scenario is discussed over at some questions about paxos Clearly a locking service cannot have reads and writes to the locks "running concurrently" hence why it does a round of (multi)Paxos for each client message to strictly order reads and writes across the cluster.

Perform reading from a paxos-based distributed cluster

1 Answers1