What if log replication out-of-order of etcd raft?

Question

I'm the newbie in etcd and have some confusion points about log replication:

For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first.

How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.

func (pr *Progress) MaybeUpdate(n uint64) bool {
    var updated bool
    if pr.Match < n {
        pr.Match = n
        updated = true
        pr.ProbeAcked()
    }
    pr.Next = max(pr.Next, n+1)
    return updated
}

How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.

wpedrak · Accepted Answer · 2021-04-07T09:05:30.317

Questions you asked are more raft than etcd related (etcd implements raft, so they are still relevant tho). To get high level understanding of raft algorithm I highly recommend you to to check out raft webpage and raft paper (it's really nicely written!). I believe that section 5.3 "Log replication" would be helpful.

First let's put some foundation: Leader keeps track of matching entries with every follower. It keeps information in nextIndex[] and matchIndex[] in the paper (check Fig. 2) and in ProgressMap in etcd.

// ProgressMap is a map of *Progress.
type ProgressMap map[uint64]*Progress

type Progress struct {
    Match, Next uint64
    ...
}

Now let's jump to your questions.

For example, leader send out {term:2,index:3} and then {term:2,index:4}, the majority respond in order too. But due to network delay, leader receive the responses out of order, receive response of {term:2,index:4} first. How etcd handle such case? It seems like just ignore the log {term:2,index:3}, commit {term:2,index:4} directly.

Here all depends on state of the follower (from leader perspective). Let's dive into StateProbe and StateReplicate.

In StateProbe leader tries to figure out which entries it should send to the follower. It sends one message at the time and waits for response (which might be reject response in which case leader have to decrease Next related to this follower and retry). In this state sending 2 different MsgApp to the same follower is not possible.

In StateReplicate leader assumes that network is stable and sends (potentially) many MsgApp messages. Let's work on example.

Match := 2, Next := 2

Follower log : [E1, E2] (E stands for "entry")

Leader log: [E1, E2]

In this state leader gets put request for entries E3, E4 and E5. Let's assume that max batch size is 2 and thus all new entries can't be send in single message. Leader will send 2 messages: (Index: 3, Entries: [E3, E4]) and (Index: 5, Entries: [E5]). Second message will be send before ack for first one is obtained. In case in the picture, follower gets first message, checks if it can append it by using Index from request (check is performed in (raft).handleAppendEntries > (raftLog).maybeAppend > (raftLog).matchTerm > (raftLog).term), appends entries to it's log and sends ack. Later on, follower gets 2nd request and does the same for it (checks if it can append it and sends ack).

Fact that follower checks if it can append entries before sending ack is important here. Once leader get ack for any message it is sure that all entries up to Index + len(Entries) are populated in follower's log (otherwise this message would be rejected instead of acked). Thanks to that, it is not important if first ack is delayed (or even lost).

How etcd retry when response packet(e.g. resp of {term:2,index:3}) loss happen? I can't find any code snippet to handle this in the etcd project.

I'll focus on etcd now as in raft paper it is described as "the leader retries AppendEntries RPCs indefinitely", which is rather non constructive. Every short interval, leader sends MsgHeartbeat to the follower and latter responds with MsgHeartbeatResp. As part of MsgHeartbeatResp handling, leader does following

if pr.Match < r.raftLog.lastIndex() {
    r.sendAppend(m.From)
}

Which should be read as: "If there is any entry that is not present on the follower, send him first missing entry". This can be seen as retry mechanism as pr.Match will not increase without ack from follower.

Thanks a lot, the second question is clear to me now. For the first question, how the leader guarantee that NOT send new entry until previous one is acked? In stepLeader function, when MsgProp comes, leader append entry(update it's Next also) and then send to peers, but don't check the previous entry respond or not at all. — TAKCHI CHAN, Mar 30 '21 at 10:29
You are right, indeed. What I described above is (only slightly) simplified. In fact `MsgApp` holds many `Entry` and not single one. It means that if we don't have ack for old msg, we will send those entries again (together with new one). Imagine that leader sent entries [E1, E2, E3] to the follower. Then it got proposal to add E4. As [E1, E2, E3] was not acked (thus `Next` for this peer was not changed), it will send [E1, E2, E3, E4] to the follower. Mind that getting ack for any of those requests is valid. It will result in `Next = Next + 3` or `Next = Next + 4` depending on request. — wpedrak, Mar 30 '21 at 11:28
The example you described above only happen in `StateProbe`? If follower in `StateReplicate`, leader sent entries [E1, E2, E3] to the follower and set Next = Next + 3. Then it got proposal to add E4, leader sent entries [E4] and set Next = Next + 1. This is because every time leader send out entries, it will call function `OptimisticUpdate` of Progress. — TAKCHI CHAN, Mar 31 '21 at 03:38
You are 100% true. Thanks for pointing that out (and in fact teaching me something :D). I've dug into `StateReplicate` path and found out that in case described on your picture everything will be fine as follower acked both messages. Leader will get 2nd ack first and update `Match` accordingly (as ack for 2nd request in fact means "follower have all Entries up to index 4"). 1st ack which comes in second, will be simply ignored as `pr.MaybeUpdate(m.Index)` in `MsgAppResp` handling will be `false`. — wpedrak, Mar 31 '21 at 08:18
Let's assume that leader got 2nd ack, but the 1st ack is lost. It really bothers me, which seems to be update follower's Match to 4 directly, without retry index 3. — TAKCHI CHAN, Mar 31 '21 at 12:08
Once leader got 2nd ack it is sure that 1st ack was also send. It might be late or lost, but follower would not send 2nd ack without sending 1st. It is caused by the fact that without entries from 1st request, follower would reject 2nd request and thus not send ack for it (or rather send reject message). — wpedrak, Apr 06 '21 at 08:26
Thank you, your answer is very helpful and you are a very patient person. These days I haved read the code of `rafthttp`, it will keep the message order. — TAKCHI CHAN, Apr 07 '21 at 06:56
I'm not sure if I got it correctly, but IMHO `rafthttp` can't enforce order of leader getting acks as they might be delayed (or dropped) by network over which we don't have any control. Btw. I've updated my answer to incorporate all the comments, I hope it would clarify it a bit. — wpedrak, Apr 07 '21 at 09:09

What if log replication out-of-order of etcd raft?

1 Answers1