Consider the following scenario:
3 EC2 instances located in:
- US-WEST
- Ireland
- Tokyo
Each instance is a dedicated CouchDB server. Each CouchDB server is setup to run continuous replication with every other server (bi-directional).
Now assume that the Ireland server goes offline due to some AWS outage. The US-WEST and Tokyo CouchDB servers will retry X number of times and then eventually fail replication with that server (is this correct?)
Lets say 6 hours go by and AWS gets the region back online and that server comes back up -- I assume US-WEST and Tokyo will ignore the server in Ireland until the Irish CouchDB server re-initiates the bi-directional sync with both of them, a la:
Irish CouchDB _replicator Pseudo-Settings
- replicate[source=localhost,target=us-west]
- replicate[source=us-west,target=localhost]
- replicate[source=localhost,target=tokyo]
- replicate[source=tokyo,target=localhost]
Q1: Is my understanding of Couch's replication failure/recovery correct?
Q2: What if there is a network failure that fixes itself an hour later (specifically: there is no server restart forcing the DB to re-init itself on startup), how do the respective CouchDB instances react to this? I imagine that us-west and tokyo will forget about Ireland, but will Ireland suddenly start talking with those two servers again, re-initializing the bidirectional, continuous replication?
I am specifically interested in failure recovery in the EC2 environment, so if there is a specific detail to that environment I have missed, please let me know.
Thanks!