OpenLDAP syncrepl does not recover after network interruption

Question

We have an issue we have discovered where syncrepl does not recover after a network interruption.

Environment:

Centos 7 replicas
OpenLDAP 2.4.44(-21.el7_6)
Loglevel set to comm sync (produces nothing useful)

synchronization configuration:

syncrepl rid=113
  provider="ldap://ldap.example.com"
  type=refreshAndPersist
  retry="60 +"
  searchbase="dc=example,dc=com"
  sizelimit=unlimited
  bindmethod=simple
  starttls=critical
  tls_reqcert=allow
  binddn="cn=Replicator,dc=example,dc=com"
  credentials="supersecretpassword"
updateref ldap://ldap.example.com

One of the aspects of LDAP service we monitor is currency and consistency of all production replicas. A cron job updates a "canary record" with current Unix epoch in an easily queried attribute on the primary. Replicas are monitored regularly for currency by comparing their epoch against that of the canary record. If the difference is more than 10 minutes, we are notified.

We have a couple of replicas which are, so far as we can tell, occasionally subject to brief network disruptions. Unfortunately we have limited insight into this issue: we suspect at this point this is the cause of these replicas occasionally failing currency tests. When this happens, the replicas will continue to fail the test--they will fail to sync--until slapd is restarted. This is despite the setting retry="60 +" which should, as I understand it, retry synchronization every 60 seconds, forever.

This problem can be reproduced on any of our replicas by creating a firewall rule dropping any traffic from the primary. I have created two rules: one to drop all traffic to the primary, and one to drop all traffic from the primary. Wait for some period (not sure how long, I'll typically switch tasks until I get the first notification) and then remove the firewall blocks. Synchronization never recovers.

What I have found is that the packet counters on the blocking rules are 0 packets from the replica to the primary, and >0 from the primary to the replica. This matches the expected "push" behaviour of syncrepl but we expected the replica to retry the connection somehow.

The replica retains an outgoing connection to the primary.

I believe the use of keepalive, or, if all else fails, refreshOnly will resolve the currency issue for us, but it seems surprising an occasional network interruption like this could cause synchronization to freeze in what I think is a pretty standard configuration (refreshAndPersist examples do not include or recommend the use of keepalive).

Can somebody advise the best change to make here?

score 0 · Answer 1 · answered Sep 17 '20 at 17:07

I have rolled out keepalive="360:60:60" across our replicas and tested. So far as I know as of this writing, this is the best method to deal with this issue.

Testing was performed in the same way as reproducing the problem in the original question. I did notice when testing against a test primary running on CentOS 7, the issue did not occur in the tested timeframe--the connection recovered regardless of setting keepalive. I have not yet investigated why this is. Regardless, we have a current primary in production running CentOS 6 at this time and this appears to solve the issue of network interruptions killing the synchronization.

Not noted in the original question: we may be seeing this issue because our replicas are distributed across a wide area, crossing many different network boundaries and different types and ages of devices.

OpenLDAP syncrepl does not recover after network interruption

1 Answers1