We have an issue we have discovered where syncrepl does not recover after a network interruption.
Environment:
- Centos 7 replicas
- OpenLDAP 2.4.44(-21.el7_6)
- Loglevel set to
comm sync
(produces nothing useful) - synchronization configuration:
syncrepl rid=113 provider="ldap://ldap.example.com" type=refreshAndPersist retry="60 +" searchbase="dc=example,dc=com" sizelimit=unlimited bindmethod=simple starttls=critical tls_reqcert=allow binddn="cn=Replicator,dc=example,dc=com" credentials="supersecretpassword" updateref ldap://ldap.example.com
One of the aspects of LDAP service we monitor is currency and consistency of all production replicas. A cron job updates a "canary record" with current Unix epoch in an easily queried attribute on the primary. Replicas are monitored regularly for currency by comparing their epoch against that of the canary record. If the difference is more than 10 minutes, we are notified.
We have a couple of replicas which are, so far as we can tell, occasionally subject to brief network disruptions. Unfortunately we have limited insight into this issue: we suspect at this point this is the cause of these replicas occasionally failing currency tests. When this happens, the replicas will continue to fail the test--they will fail to sync--until slapd is restarted. This is despite the setting retry="60 +"
which should, as I understand it, retry synchronization every 60 seconds, forever.
This problem can be reproduced on any of our replicas by creating a firewall rule dropping any traffic from the primary. I have created two rules: one to drop all traffic to the primary, and one to drop all traffic from the primary. Wait for some period (not sure how long, I'll typically switch tasks until I get the first notification) and then remove the firewall blocks. Synchronization never recovers.
What I have found is that the packet counters on the blocking rules are 0 packets from the replica to the primary, and >0 from the primary to the replica. This matches the expected "push" behaviour of syncrepl but we expected the replica to retry the connection somehow.
The replica retains an outgoing connection to the primary.
I believe the use of keepalive
, or, if all else fails, refreshOnly
will resolve the currency issue for us, but it seems surprising an occasional network interruption like this could cause synchronization to freeze in what I think is a pretty standard configuration (refreshAndPersist examples do not include or recommend the use of keepalive
).
Can somebody advise the best change to make here?