1

I have a pair of Windows (2003 Server) servers both running MS SQL Server (2008 EE) that each hang every few months. This has been occurring intermittently :( for the past 15 months pretty much since we started using the servers.

The symptoms are as-follows:

  • I cannot remote desktop in to troubleshoot; when I attempt to, I get stuck on a blank black screen and am never offered a login prompt
  • I can still ping the servers
  • I can still open a SQL connection to the server, and, CURIOUSLY/BIZARRELY, when I do a "select getdate()", the time it returns appears to be stuck on the exact fraction of a second when (I presume) the server hung. Repeated attempts to do "select getdate()" keep getting that same date, suggesting that the clock is frozen.
  • Filesharing attempts to connect to the hung server fail with the error message: "\ServerName is not accessible. You might not have permissions to use this network resource. Contact the administrator of this server to find out if you have access permissions. The server's clock is not synchronized with the primary domain controller's clock." This is consistent with a frozen clock.
  • Post-reboot, if I investigate the Windows Event Viewer logs, I can see many security accesses (coming from me and others) that I recognize were login attempts during the "down" period, but all of them in the security log are associated with that same timestamp of when the server hung. This also suggests the clock is frozen. There is not a clear cause in the Application or System event logs.

I have a local Admin account on the server and am in the process of getting a domain-credentialed Admin account for better remote admin access.

HP is supposed to be supporting these machines and has some low-level ILO2 access but they seem incapable of finding the root cause.

A reboot will "fix" the problem but I would like to get to the root cause and solve the issue. Has anyone ever seen something like this odd clock behavior?! (If it were just one server I'd perhaps say a bad hardware clock, but two?) Can anyone advise me on what I should try to troubleshoot this sort of situation to find the root cause (or what I should tell HP to try?)

GregW
  • 314
  • 4
  • 6
  • Does this happen at the same time? Are the two sql servers a cluster, sharing a SAN? I've seen stuff similar to this, but not with all the mentioned issues, if the disk usage far exceeded the capability of the disk. Nothing else in the event log? – Nixphoe Apr 27 '11 at 13:39
  • 1
    can you please provide information about level of updates and SP? Have you tried implementing and installing the latest Microsoft patches? Have you tried, upgrading BIOS, CHIPSET, NIC's firmware, drivers e.t.c from HP's site? – Vick Vega Apr 27 '11 at 16:43
  • @Nixphoe: The two don't hang at the same time. And the two sql servers are not a cluster nor do they share a SAN. Disk usage does not appear to be a culprit. I've had ~8 clock-hangs over ~18 months and the event logs don't show a clear consistent culprit message occurring as/right-before each hang. – GregW May 03 '11 at 09:56

2 Answers2

1

As Nixphoe has pointed out - Event Logs, Event Logs, Event Logs would be the first place to look.

It does "sound" like you may have some kind of memory leak condition with something in common between the applications installed and/or configurations. There are multiple resources available on the subject of tracking memory usage. Tracking across time may be required in order to identify the offending application and/or condition.

user48838
  • 7,431
  • 2
  • 18
  • 14
  • Thanks. I've had ~8 clock-hangs over ~18 months and the event logs don't show a clear consistent culprit message occurring as/right-before each hang. I too am somewhat suspicious of a memory leak or memory pressure problem, since the issue sometimes, but not always occurs around the time of a nightly memory-intensive ETL operation (but I have no idea how that would hang a system clock!) – GregW May 03 '11 at 10:02
  • Is it truly a clock hang or did the OS hang completely due to lack of available memory or some other faulty interaction/situation? How is it showing as a clock hang? Can you chart memory usage/availability over time? – user48838 May 04 '11 at 12:45
  • The signs of clock-hang are three-fold: – GregW May 08 '11 at 10:34
  • The signs of clock-hang are three-fold: 1) query connections to SQL server which is still running on the box keep showing the same getdate() when queried repeatedly 2) after a reboot, the windows security event log shows tens or hundreds of login and other attempted events showing at the same time (but I know firsthand they occurred during the hours-long hung period) 3) when I try to open a filesharing connection to the \\ UNC path on the 'hung' server, I get an access denied message, with the reason being "The server's clock is not synchronized with the primary domain controller's clock". – GregW May 08 '11 at 10:41
  • I don't know whether the root cause is the clock hang or that is merely a symptom of memory pressure or some other issue. I sadly am not sure how to catch the memory availability over time; can one setup some months-long-running perfmon job to do that (that keeps running when I disconnect from the box?) or would you suggest something else? – GregW May 08 '11 at 10:58
  • You are on the right track. The performance counters may be the simplest as it is basically built-into the the Windows for just this type of situation. – user48838 May 08 '11 at 17:21
1

User48838 is right. It sounds like a memory leak.

For detecting memory leaks, check out this article from Microsoft: http://technet.microsoft.com/en-us/library/cc938582.aspx This explains exactly what you have to look at in terms of performance counters.

Also, there is a very useful tool from Microsoft, Debug Diagnostic Tool. I have used it a few times and it really does the job. Here are some instructions on how to use it.

Can you give us more details about the server? Specs, NICs, OS sp and bitsize, etc? I know that there was a problem with Win 2k3 + SQL 2008 on HP Proliant servers which resulted to a clock-drift or to an unresponsive server. However, I am not sure if that applies in this case because I don't have sufficient details, but I give you the article from Microsoft just in case: http://support.microsoft.com/kb/2022911

I hope this helps.

Alex
  • 23
  • 1
  • 1
  • 5