Here's the replies I've got at CouchDB mailing lists:
If we are talking Couch 1.6, the attribute retries_per_request
controls a number of attempts a current replication is going to do to
read _changes feed before giving up. The attribute
max_replication_retry_count controls a number of attempts the whole replication job is going to be retried by a replication manager.
Setting this attribute to “infinity” should make the replicaton
manager to never give up.
I don’t think the interval between those attempts is configurable. As
far as I understand it’s going to start from 2.5 sec between the
retries and then double until reached 10 minutes, which is going to be
hard upper limit.
Extended answer:
The answer is slightly different depending if you're using 1.x/2.0
releases or current master.
If you're using 1.x or 2.0 release: Set "max_replication_retry_count =
infinity" so it will always retry failed replications. That setting
controls how the whole replication job restarts if there is any error.
Then "retries_per_request" can be used to handle errors for individual
replicator HTTP requests. Basically the case where a quick immediate
retry succeeds. The default value for "retries_per_request" is 10.
After the first failure, there is a 0.25 second wait. Then on next
failure it doubles to 0.5 and so on. Max wait interval is 5 minutes.
But If you expect to be offline routinely, maybe it's not worth
retrying individual requests for too long so reduce the
"retries_per_request" to 6 or 7. So individual requests would retry a
few times for about 10 - 20 seconds then the whole replication job
will crash and retry.
If you're using current master, which has the new scheduling
replicator: No need to set "max_replication_retry_count", that setting
is gone and all replication jobs will always retry for as long as
replication document exists. But "retries_per_request" works the same
as above. Replication scheduler also does exponential backoffs when
replication jobs fail consecutively. First backoff is 30 seconds. Then
it doubles to 1 minute, 2 minutes, and so on. Max backoff wait is
about 8 hours. But if you don't want to wait 4 hours on average for
the replication to restart when network connectivity is restored, and
want to it be about 5 minutes or so, set "max_history = 8" in the
"replicator" config section. max_history controls how much history of
past events are retained for each replication job. If there is less
history of consecutive crashes, that backoff wait interval will also
be shorter.
So to summarize, for 1.x/2.0 releases:
[replicator] max_replication_retry_count = infinity
retries_per_request = 6
For current master:
[replicator] max_history = 8 retries_per_request = 6