Thanks for taking the time to check out my problem.
I'm currently working on an issue that has only appeared once before. Back on Jan 3rd when this first appeared we were able to reboot the server and everything seemed fine, but now it is back. This is a production database system, so finding a window to reboot can sometimes be difficult. I'm hoping to get a firm grasp on what may actually be happening this time before we reboot again in a few days to provide another temp fix for the issue. Here we go...
User authentication for the system in question is handled with LDAP via Red Hat Directory Server 9. The issue described below is only seen on this one server and even it's counterpart that shares the database doesn't display the same symptoms. As of right now, no LDAP accounts are able to authenticate and log into the server. LDAP auth is being handled by SSSD, which is currently unable to be stopped or restarted. When attempting to do either the SSH console becomess unresponsive. (ctrl-c is unable to exit the issued command)
PS shows the usual sssd related processes are running, but attempting kill -9
on them doesn't seem to do successfully stop any of them.
ps aux | grep sss | grep -v grep
root 1150 0.0 0.0 150828 2908 ? D 09:05 0:00 /usr/libexec/sssd/sssd_nss -d 0 --debug-to-files
root 7025 0.0 0.0 93616 2504 pts/2 D 16:18 0:00 /usr/sbin/sssd -f -D
root 11148 0.0 0.0 179436 5672 ? D Jan08 16:22 /usr/libexec/sssd/sssd_be -d 0 --debug-to-files --domain default
root 32700 0.0 0.0 150784 2908 ? D 10:10 0:00 /usr/libexec/sssd/sssd_pam -d 0 --debug-to-files
Using strace getent -s sss passwd
I can see that some of the connection attempts are being refused, but I'm not really sure what to do about them.
connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"...}, 110) = -1 ECONNREFUSED (Connection refused)
close(3) = 0
socket(PF_FILE, SOCK_STREAM, 0) = 3
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
fcntl(3, F_GETFD) = 0
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"...}, 110) = -1 ECONNREFUSED (Connection refused)
close(3) = 0
socket(PF_FILE, SOCK_STREAM, 0) = 3
fcntl(3, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
fcntl(3, F_GETFD) = 0
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
connect(3, {sa_family=AF_FILE, path="/var/lib/sss/pipes/nss"...}, 110) = -1 ECONNREFUSED (Connection refused)
Checking lsof | head -n1; lsof | grep /var/lib/sss/pipes/
shows far less open pipes between the good and the bad system. The PIDs for these pipes are the same reported from ps aux
, so attempting kill -9
on them has been fruitless as well.
bad sssd
lsof | head -n1; lsof | grep /var/lib/sss/pipes/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sssd_be 11148 root 15u unix 0xffff8806635911c0 0t0 31817638 /var/lib/sss/pipes/private/sbus-dp_default.11148
sssd_be 11148 root 16u unix 0xffff880d443d6180 0t0 31783555 /var/lib/sss/pipes/private/sbus-dp_default.11148
sssd_be 11148 root 17u unix 0xffff880c536d94c0 0t0 31783560 /var/lib/sss/pipes/private/sbus-dp_default.11148
good sssd
lsof | head -n1; lsof | grep /var/lib/sss/pipes/
COMMAND PID USER FD TYPE DEVICE SIZE/OFF NODE NAME
sssd 26793 root 13u unix 0xffff88030b5d8c40 0t0 3248762734 /var/lib/sss/pipes/private/sbus-monitor
sssd 26793 root 14u unix 0xffff8808cc064bc0 0t0 3248762735 /var/lib/sss/pipes/private/sbus-monitor
sssd 26793 root 15u unix 0xffff880a9d9bc840 0t0 3248768164 /var/lib/sss/pipes/private/sbus-monitor
sssd 26793 root 16u unix 0xffff880040a32f00 0t0 3248768165 /var/lib/sss/pipes/private/sbus-monitor
sssd_be 26794 root 15u unix 0xffff8808cc064200 0t0 3248767368 /var/lib/sss/pipes/private/sbus-dp_default.26794
sssd_be 26794 root 16u unix 0xffff880a9d9bd880 0t0 3248763661 /var/lib/sss/pipes/private/sbus-dp_default.26794
sssd_be 26794 root 17u unix 0xffff8809841b4480 0t0 3248763662 /var/lib/sss/pipes/private/sbus-dp_default.26794
sssd_nss 26795 root 16u unix 0xffff880a9d9bd200 0t0 3248751954 /var/lib/sss/pipes/nss
sssd_pam 26796 root 16u unix 0xffff880859e26180 0t0 3248774325 /var/lib/sss/pipes/pam
sssd_pam 26796 root 17u unix 0xffff880859e27b80 0t0 3248774326 /var/lib/sss/pipes/private/pam
Also, /var/log/secure containes multiple entries of
sshd[9177]: pam_succeed_if(sshd:auth): error retrieving information about user
su: pam_sss(su-l:session): Request to sssd failed. Connection refuse
crond[29568]: pam_sss(crond:session): Request to sssd failed. Connection refused
Additionally, one of the first things I noticed was that the /var/log/messages file contained no data. Both it and /var/log/sssd/ logs seem to have stopped collecting around 9:03 this morning, /var/log/secure kept accumulating data without issue. Restarting syslog fixed the issue for mesages, but sssd logs are still not functioning.
Last thing I noticed dmesg is filled up with messages like audit: backlog limit exceeded
audit: audit_backlog=322 > audit_backlog_limit=320
and audit_log_start: 122 callbacks suppressed
. I assumed these are from when syslog wasn't working working properly, but haven't verified that, yet.
I'm still researching into this and hope I'll find something, but more than welcome any suggestions and feedback people are willing to provide.
Thanks a lot!
-Omni