0

I was trying to read the Paxos Commit paper and am struggling witih moving past the introduction. The intial section builds a motivation for a fault-tolerant transaction coordinator implementation in the two-phase commit protocol by describing regular two-phase commit as "blocking" when the transaction coordinator fails

The failure of that coordinator can cause the protocol to block, with no process knowing the outcome, until the coordinator is repaired.

My question is this - if the coordinator fails, assuming the coordinator's state is a deterministic function of the responses of the resource-managers (or the individual databases); then why can't we simply have any of the other resource-managers query every other resource-manager for their response and "repair" progress? Essentially taking up the role of the coordinator after a timeout period.


This is assuming the individual resource-managers are modelled as fault-tolerant black boxes (eg. they are implemented with their own multi-paxos implementation on a cluster of n machines)

Curious
  • 20,870
  • 8
  • 61
  • 146

2 Answers2

1

What you propose is indeed what many people have done with 2PC, the very same paper you referenced explains why that strategy is not correct in section 3, in Lamport's words:

In particular, if the TM fails right after every RM has sent a Prepared message, then the other RMs have no way of knowing whether the TM committed or aborted the transaction

In my words: imagine the original coordinator is not dead, but just stuck for a long time (GC, deadlock, whatever). After the timeout, another node would pick up the slack. Now, the original coordinator could wake up and choose to commit while the new coordinator could choose to abort. Depending on the interleaving of messages, some RMs would end up in the committed state, and others on the aborted state, which is a system failure.

Misguided
  • 1,302
  • 1
  • 14
  • 22
  • So one thing I just realized is that the RMs themselves are assumed to be faulty resources and that their responses can change (or simply error out without causing an error on the transaction itself)! Is it fair to say that if this were not the case and the RMs were running their own multi-paxos implementation, then we could have an easier way out because their responses would be _guaranteed_? In which case, we can follow the third phase without non-determinism (as in the case you explained)? – Curious Jul 29 '20 at 17:48
  • Sorry, but I don't understand what you are proposing – Misguided Jul 29 '20 at 21:20
  • So the case you mentioned - a TM can either treat an RM's response as having sent a prepared message or not. Then future TMs would have no way of knowing what decision was taken by the earlier TM since the RMs might be partitioned away from the new TM itself. However, if we treat each RM as a fault-tolerant black box, then we know that any new TM will be able to get a response from an RM, and the response will tell us what their state is with respect to the transaction state. Then a new TM can simply poll all RMs and make progress on the leftover transaction. Does that make explain things? – Curious Jul 30 '20 at 00:19
  • It is unrealistic to assume that RMs can't fail; for example, it could be out of disk space, etc. Even if you did assume that RMs can't fail, the issue depends on the fact that the TM can ask RMs to abort or commit, and it is unclear to me that the TM would never have any reason to abort a transaction if the RMs couldn't fail. For example, what if network topology had changed to contain less than a majority, just before the TM "hung"? any ongoing transactions could have to be aborted, but if the new TM did have access to all RMs, it could choose to commit. – Misguided Jul 30 '20 at 16:07
0

I think "the coordinator's state is a deterministic function of the responses of the resource-managers" is not correct. It not only depends on the content of responses (yes or no), it also depends on whether those responses can all be sent to coordinator in time. Assuming all resource-managers respond with yes, if these responses arrive at the coordinator in time, then the coordinator will commit. If some arrive late due to network latency or partition, then the coordinator might abort. Therefore the resource-managers can't know whether the old coordinator has committed/aborted or the new coordinator will commit/abort, by merely talking with each other.