11

When our gMSA accounts are automatically rotated, we see login failures for around 1-10 minutes. This is particularly apparent for gMSA client accounts that connect to MS SQL server, but I think it happens for other gMSA accounts as well. MS SQL server is not running as a gMSA account, but our application uses gMSA to make a client connection to SQL. By default ManagedPasswordIntervalInDays is every 30 days, so we see this every month at the same time.

When I check the domain controller logs, I don't see any login failures for the gMSA user, but the SQL server logs the following error

SSPI handshake failed with error code 0x8009030c, state 14 while establishing a connection with integrated security; the connection has been closed. Reason: AcceptSecurityContext failed. The operating system error code indicates the cause of failure. The logon attempt failed [CLIENT: x.x.x.x]

From what I have found, this error usually indicates the wrong username/password combination.

This occurs on multiple clients, and each eventually starts connecting again after anywhere from 1-10 minutes. The clients don't all start connecting at the same time, but it seems to be randomly within that time window.

Initially I thought it might be related to AD replication of the changed password, so we modified the default inter-site replication interval to USE_NOTIFY to replicate immediately. If replication were the issue, I would expect to see login failures on DC's and I'm not seeing logon failures on DC's. I had also thought that maybe the SQL server is caching the authentication token, but if that were the case, I would expect to see all clients resolve at the same time (ie when the SQL server refreshed) Being that the clients each start working again at a different time, it doesn't appear to be on the SQL server side, but more likely something on the client side. Maybe caching the gMSA password or maybe something related to timeout and retry back offs.

devons
  • 153
  • 1
  • 9
  • Time drift? Are the clients all synchronized via NTP? – AlwaysLearning Dec 02 '20 at 08:43
  • Good point, yes we are syncing time via NTP, and I've confirmed that the affected servers are all within milliseconds of each other. –  Dec 03 '20 at 13:21
  • I would check if the client's Kerberos ticket needs to be renewed when the gMSA password is changed. Test that with `klist purge` from the client when it fails connection. – Hannah Vernon Dec 15 '20 at 20:56
  • 1
    or klist purge when you refresh the gmsa's on a server – jcolebrand Dec 15 '20 at 20:56
  • klist purge and kerberos tickets are good ideas, as it does seem to be related to the ticket being cached on the client. I did try purging the cache for all users with: Get-WmiObject Win32_LogonSession | Where-Object {$_.AuthenticationPackage -ne 'NTLM'} | ForEach-Object {klist.exe purge -li ([Convert]::ToString($_.LogonId, 16))} This seems to have improved things, but I still see some errors, so doesn't seem to be the entire solution – devons Jan 04 '21 at 20:05

2 Answers2

4

I found that this was due to the way the Windows service was configured. The Windows Service was configured as a standard service using a regular user account which happened to be gMSA account rather than Windows Service using a managed account.

This can be verified with:

>sc.exe qmanagedaccount ServiceName

[SC] QueryServiceConfig2 SUCCESS

ACCOUNT MANAGED : FALSE

This can be changed by running

sc.exe managedaccount ServiceName TRUE

After changing the Windows Service account type to be managed, initial testing shows that logins are now successful during the gMSA password rotation.

devons
  • 153
  • 1
  • 9
  • I know the problem went away for you after this, but why do you think this fixed the issue? The docs for `sc.exe managedaccount` indicate that this tells SCM to query for the password on service start and not while its running. It also seems to me that setting the password to NULL does the same thing. We're having the same problem with our setup and will try this, but it doesn't make much sense to me without some explanation. – Ritch Melton Mar 16 '22 at 10:47
  • This initially seems to improve things, but it's not a fix. We are still seeing the issue, and I don't have a solution yet. After further research, I found that gMSA accounts have a 5 minute window where both the old password and the new password are accepted. We don't see any errors when the password is rotated, and they start 5 minutes after the password rotation when that window closes. – devons Mar 17 '22 at 12:28
  • The docs indicate that SCM saves the old password as 'backup' and attempts it if the new one doesn't work, but its not really clear how often it fetches the password for the gMSA. My suspicion is that the password change refresh is abstracted away by LSASS and from SCM's perspective its no different than a local managed account, like LocalSystem, or a virtual account and that the NULL password trick tells SCM to not cache the password (or not cache it for long) in order to pull it from LSASS. Again, wild guesses here, but LocalSystem, et al, have been around for a while without issue. – Ritch Melton Mar 22 '22 at 11:02
2

We were generating the same error because of a SPN issue that caused the gMSA to authenticate to sql server via NTLM instead of Kerberos. If you log into sql server and check the sessions via sys.dm_exec_connections you should see a list of sessions with NTLM

NTLM Sessions

(you can also use klist sessions from the cli to view the sessions)

We were able to correlate our errors with the password changes with log analytics tools, so we knew that was the culprit. I do not know how often the SCM refreshes its copy of the password but if the service is authenticating to sql server and using Kerberos the I believe password changes should be independent of the Kerberos session lifetime/renewal so the generated error is a solid clue that the password is being sent to the sql server host via NTLM. Once we fixed our SPN issue (which was due to an additional DNS A record) the sessions switched over to Kerberos authentication.

Ritch Melton
  • 225
  • 2
  • 5
  • Thanks for the insight. We defiantly had missing SPN records for some of the SQL servers. That has recently been fixed as part of another effort. I'll have to wait till the next rotation to see if it makes a difference. – devons Mar 18 '22 at 12:29
  • While I think this is the correct solution but we're waiting for the next rotation as well. – Ritch Melton Mar 22 '22 at 10:58
  • After waiting for the next gMSA password rotation, we are no longer seeing errors around rotation. Solution: Our SQL servers had Always On listeners which did not have proper SPN records registered. This forced connections to use NTLM auth instead of Kerberos. Once the proper SPN records were registered in AD, connections use Kerberos, and we don't get errors during rotation. – devons Apr 15 '22 at 13:06