9

I have a Windows 2008R2 server running NSClient++. For some reason the service has got its knickers in a twist and stopped responding to Nagios polling.

When I tried to restart the service the service manager takes a long time to try and kill the service then eventually gives up with a message along the lines of "the service took too long to respond". But...it also starts a new instance of the service.

If I look in Task Manager or tasklist I can now see two instances of nsclient++.exe running.

I tried to kill both of these using:

  • right click and "End Process" in task manager - pretends to kill the process and reports no errors (for example Access Denied) but the process is still there.

  • taskkill /PID <proc id> /F - reports SUCCESS: The process with PID 6672 has been terminated. but the process is still running.

  • downloaded SysInternals PsTools and ran pskill <PID> - reports Process <PID> killed - yet the process is still there.

  • execute at hh:mm pskill <PID> to get pskill to do this as the SYSTEM account ... and you guessed it the process is still running.

All of the above were run in an Administrator command prompt.

Other than a reboot which is not really ideal (the box is a fairly mission critical production server), what else can I try?

The server isn't under any resource pressure (memory, CPU, disk etc) and everything running on it is chugging along just fine.

As quick look at the threads tab in SysInternals Process Explorer shows that all of these nsclient++.exe instances are stuck unloading:

enter image description here

As an aside, I also tried killing all of the TCP connections for these zombie(?) processes (with TCPView) in the hope that I could start a new instance and it would be able to grab port 5666. Then we could reboot the server when things are quieter, but alas that didn't work.

Kev
  • 7,877
  • 18
  • 81
  • 108
  • 3
    If a process will not kill with Task Manager then it's actually stuck in a kernel routine... So Windows is having problems. Do you have any "interesting" drivers installed? – Chris S Aug 24 '12 at 15:29
  • There's nothing really exotic running driver-wise. It's XenServer VM so has the usual Xen drivers which we generally don't have trouble with. We also run R1 CDP Enterprise and that seems to be operating within our normal operating parameters. I added screenshot showing the Thread's tab from procexp.exe. – Kev Aug 24 '12 at 15:46
  • If you click on `Stack`, what does the stack look like for the stuck threads? – HeatfanJohn Aug 24 '12 at 16:06
  • @HeatfanJohn - I thought of that too but get an error *"Error accessing thread"* when I do that. – Kev Aug 24 '12 at 16:10
  • My guess is that is related to @ChrisS' comment on being stuck in a kernel routine. – HeatfanJohn Aug 24 '12 at 16:17
  • Yeah, looks like a reboot tonight. – Kev Aug 24 '12 at 16:22

1 Answers1

3

Even though it seems you've figured this out already, the problem is that the process is waiting on the Kernel for something. (This is usually a driver-level problem, but not always.) The only way to kill such a process is to unload the kernel, which, of course, you can't do without rebooting.

Might be worth trying some kernel debugging (does this tool work on 2008 R2?) in the hopes of narrowing down the specific cause or conflict, but your options for handling the problem are either living with it, or rebooting the server to eliminate it.

Is there a reason you haven't considered living with it? If it's just a zombie process, and it's not impacting anything, I'd think you could put off a reboot until a maintenance window or more opportune time. Typically my approach, when the zombie or hung process isn't interfering with anything - take care of it during the next patch cycle or scheduled maintenance window.

HopelessN00b
  • 53,795
  • 33
  • 135
  • 209
  • Sadly too late to examine these processes in WinDbg, the infrastructure guys have rebooted the server. But handy to know for next time. – Kev Aug 24 '12 at 19:57
  • The other problem was that we couldn't live with it like this. The service is NSClient++ which we use in conjunction with nagios. I couldn't even get a fresh service exe to run and respond to polling requests, I think because these zombied processes were still hanging onto port 5666 which it listens on (could certainly see one of them still holding onto the port in TCPView and I couldn't close it). – Kev Aug 24 '12 at 20:00
  • Well, that's certainly a very good reason not to live with it. – HopelessN00b Aug 24 '12 at 20:10
  • If it happens again, don't forget another one of Mark Russinovich's babies - Process Monitor. Point procmon at the process to see what it's doing. Wonderful tool. – Simon Catlin Aug 24 '12 at 22:52
  • @SimonCatlin - aye, I did that too but nothing really jumped out at me. – Kev Aug 24 '12 at 23:14