We've an application that uses LDAP for authentication. The application is deployed on four MiddleWare servers in a load balanced configuration using F5. Though there are 8 domain controllers in the AD store, we continue to get an error on all MW servers daily that seem to point to ActiveDirectory not being reachable. Mostly reported by users during peak user logon times, but also happens at other times throughout the day. The text of the error message is slightly different when binding to LDAPS vs LDAP, but the stacktrace is the same.
Error message:
- (LDAPS) - "The server is not operational. (0)"
- (LDAP) - "The directory service is unavailable. (0)"
We've tried all the possible configuration options but the error still persists
- Connect to secure/non-secure DC's (port 636/389)
- Connect using server/server-less bind
After all the research we are moving towards implementing a code change that retries the bind operation once on each DC in the pool before throwing an error to the user.
The way this would work is when the application starts, the Directory Service access component of the application would do the following:
Build a list of all domain controllers in the site (and any adjacent sites, if preferred) Perform a series of tests to validate connectivity (ping, 389/3268/636 test). This would also confirm if it is a DC, GC, or RODC. Perform a simple query to validate the directory service is functional and authentication is working. Save a list of the known good domain controllers, and also a list of the offline domain controllers.
We then use these known good servers when performing a bind, embedding the server in the bind path. If an exception occurs and is one of the types that would indicate a problem with the dc (server not operational, busy, timeout, etc), we add that dc to the offline list and attempt the operation using one of the other dc's.
Is this approach a viable option? Are there any trade-offs? Do you suggest analyzing Wireshark data would help in determining the root cause? Is it somehow related with unavailability of TCP/UDP ports at MW server?