How does CouchDB Replication behave with Failed / Recovered servers?

Question

Consider the following scenario:

3 EC2 instances located in:

US-WEST
Ireland
Tokyo

Each instance is a dedicated CouchDB server. Each CouchDB server is setup to run continuous replication with every other server (bi-directional).

Now assume that the Ireland server goes offline due to some AWS outage. The US-WEST and Tokyo CouchDB servers will retry X number of times and then eventually fail replication with that server (is this correct?)

Lets say 6 hours go by and AWS gets the region back online and that server comes back up -- I assume US-WEST and Tokyo will ignore the server in Ireland until the Irish CouchDB server re-initiates the bi-directional sync with both of them, a la:

Irish CouchDB _replicator Pseudo-Settings

replicate[source=localhost,target=us-west]
replicate[source=us-west,target=localhost]
replicate[source=localhost,target=tokyo]
replicate[source=tokyo,target=localhost]

Q1: Is my understanding of Couch's replication failure/recovery correct?

Q2: What if there is a network failure that fixes itself an hour later (specifically: there is no server restart forcing the DB to re-init itself on startup), how do the respective CouchDB instances react to this? I imagine that us-west and tokyo will forget about Ireland, but will Ireland suddenly start talking with those two servers again, re-initializing the bidirectional, continuous replication?

I am specifically interested in failure recovery in the EC2 environment, so if there is a specific detail to that environment I have missed, please let me know.

Thanks!

score 4 · Accepted Answer · answered Aug 13 '11 at 19:57

Prior to 1.1, a replication task is not persistent, even a continuous one. In the event of a disconnection, there is a limited attempt at retrying, but eventually it will stop. When connectivity resumes you will need to initiate replication again. Since replication is idempotent (starting the same replication task twice is the same as starting it once), you can just add a cronjob to start it every minute (or whatever interval seems sane to you). If the task is running already, the attempt returns success (but does not start another replication).

In 1.1, you can create persistent replication tasks by creating a document in the special _replicator database. CouchDB will retry this if it crashes or the connection is interrupted. NOTE: 1.1.0 eventually gives up, in the next release (1.1.1) we allow infinite retries.

As CouchDB is designed from the ground up to support multi-master replication, you won't be surprised to hear that it handles interruptions to connectivity very well. The changes that occurred during the interruption are rapidly found and replicated.

Robert, that is just awesome. After using SimpleDB, Redis and Mongo I suddenly became aware of Couch's obsession with replication and it was like a light went off in my head... suddenly ALL my pain points with NoSQL data stores/persistence/safety fell away and I'm left with this straight forward, simple and robust system that *just works*. Thanks very much for the clarification! — Riyad Kalla, Aug 14 '11 at 15:17

How does CouchDB Replication behave with Failed / Recovered servers?

1 Answers1