11

In the last few months I've received few reports from QA about one of our services hanging. Upon examining a hang dump using WinDbg, every time I discovered the same thing: Loader lock critical section is locked but owning thread is nowhere to be found. Since the thread is gone and the only trace that I can see is a global critical section it left behind, I don't see what code ran on thread thread, or even what DLL that thread came from, it may not even be one of ours (i.e. third party vendor).

This issue is very sporadic, only seen it maybe 3-4 times over the last 6 months occurring naturally in the wild. All other times, service runs perfectly. So this makes me believe it's some kind of timing/race condition thing.

Recently, I've decided to take it upon myself to figure this one out. I setup a machine with WinTask script that constantly starts/stops the said service. Good news is that within 5-6 hours I can reproduce the problem.

Now for next part: how do I isolate it?

This is what I've tried so far:

  1. used "debugger" field in gflags image settings to automagically run the service under cdb whenever it starts. So far this has been running for two days and never hung, so I'm thinking debugger introduced just enough of a timing change to make the issue invisible.

  2. Downloaded Application Verifier and configured the process to run with that. Found a completely unrelated bug where we create CComBSTR temporary variable, assign it to a VARIANT and pass the variant into a function call even though CComBSTR long deleted the allocated string by that point. Don't believe this bug is related because string is read-only and the thread it's running on isn't the one that's dying.

I'm making this post in case you guys could think of something that I'm not considering.

I though there was a windows utility that artificially put load on the CPU and did other things to make race conditions pop up and I thought application verifier did such a thing, but apparently it doesn't. Does anyone know what I'm taking about, or did I just dream that up?

Unless something happens over the weekend my next step would be to disable all debuggers, go back to stock and hack one of DllMains to record THREAD_ATTACH/THREAD_DETACH events. At least I'll be able to intercept the thread that's dying when it gets created. That might shed some light.

DXM
  • 4,413
  • 1
  • 19
  • 29
  • 1
    -1?? why?? did I not show enough detail? does it look I didn't do enough research? do people not ask stackoverflow when they get stumped by software development problems? – DXM Jan 15 '12 at 06:39
  • Yes, this is a perfectly valid question. The only thing that would make it better would be posting some code. I suppose that's why someone drive-by downvoted. – Cody Gray - on strike Jan 15 '12 at 07:46
  • 2
    It's a production app that's been on the market for more than 10 years. I don't even know what DLL is causing the issue, let alone posting closed-source company code online, even if I could copy/paste 5M lines of source code. Have no idea what changed, but first time I saw the problem about 6 months ago. – DXM Jan 15 '12 at 08:03
  • 1
    Don't worry too much about downvoters. I have the impression that some people downvote/try to close any question they don't know the answer to... Good luck in your search, hopefully someone will come with some hints, though the problem is *broad* :) – Matthieu M. Jan 15 '12 at 13:58

4 Answers4

2

Something I might try is attaching a kernel debugger, then run the process under Appilcation Verifier. AV has checks for unloading a DLL while it holds a CS and terminating threads that still hold CS. So those breakpoints should trigger in the kernel debugger and then hopefully you can catch it in the act. Running it under the KD hopefully won't slow it down like the user-mode debugger does.

jcopenha
  • 3,935
  • 1
  • 17
  • 15
  • kernel debugger... interesting. That's a whole realm I've never ventured into. Might be an option if all else fails. This way I can have the debugger running and monitoring things while service starts and stops. I'll need a second machine for this, right? Do I also need a serial cable or do they have other means these days? – DXM Jan 15 '12 at 20:00
  • the easiest way is in a virtual machine, and then the VM exports the COM port as a named pipe that windbg knows how to connect to. – jcopenha Jan 15 '12 at 20:11
  • I'm trying not to breath on this machine more than I have to. VM might scare this bug off just like cdb did. So assuming I leave it where it is, serial cable, right? – DXM Jan 15 '12 at 20:16
  • in that case yes, serial cable, 1394, or USB but you'll want to google for some good instructions as all of my experience has been VM based. – jcopenha Jan 16 '12 at 01:58
  • Although I didn't get to trying what you suggested, you definitely presented me an alternative I haven't considered. Adding one more to the toolbox. Thank you. – DXM Jan 17 '12 at 02:45
1

So turns out I was closer to the solution than I realized. With the service running under cdb, which altered the timing and then running it with application verifier, which altered the timing even more (page heap enabled makes allocation slower), the secret ingredient I was missing was prime95.exe. Running prime95.exe at above normal priority, really screwed up whatever timing I was trying not to change, but it made the problem show up in under 15 minutes.

The cause:

3rd party SDK for acquiring data from hardware boards. When our service starts up, we would query different capture components for their capabilities. After the query is done, we release the component instance. Apparently this one DLL started a separate thread, which acquired a loader lock and then proceeded to do a bunch of initialization in that thread. If during that time, our capability query got done and we released the component, their code would call TerminateThread() on this other thread leaving the loader lock permanently locked. Prime95 slowed everything down just enough for me to catch this race condition and get the following verifier stop message:

=======================================
VERIFIER STOP 00000200: pid 0x1A8C: Thread cannot own a critical section. 

0000091C : Thread ID.
77E17340 : Critical section address.
00000000 : Critical section debug information address.
00000000 : Critical section initialization stack trace.

Funny part is that this thread was "disappearing" without any exception of any kind, so debugger wouldn't even catch first chance anything. Who uses TerminateThread????

Thank you, everyone for suggestions and support. I was actually starting to look forward to driving to Radioshack during lunch to buy a serial cable and then spending a few days playing with KD. Looks like that will have to wait till next time :)

DXM
  • 4,413
  • 1
  • 19
  • 29
0

Some random ideas: If attaching a debugger doesn't help, then instrumentation (your last point) is the next step. But how can a thread just die without bringing down the whole process, are you catching exceptions somewhere? You might want to be logging there as well. You can also set WinDbg to break on all first-chance exceptions, if that helps. The WinDbg output window will show first-chance exceptions anyways even if you don't break.

Chris O
  • 5,017
  • 3
  • 35
  • 42
  • it only happens on start up and very infrequently so I can't have windbg attached. But this is exactly what I'm trying to achieve with cdb, which is a console version of the same debugger engine. We definitely have logging in threads we control, but I know other APIs/DLLs start their own threads and I have no idea what those are doing. Fun stuff. – DXM Jan 15 '12 at 19:58
0

I would try a non-invasive debugger, and see how that goes, while you won't be able to stop the process, you should be able to see any debugging messages as well as any threads that start and stop, and it should have minimal impact on process performance. I usually use windbg for my debugging, but I think cbd has similar options as well. This will most likely let you see what's happening in the process, and at least start helping to narrow it down. One thing you might want to make sure to do is to redirect the output (.logopen in windbg) to make sure that nothing goes outside of your buffer.

Zipper
  • 7,034
  • 8
  • 49
  • 66
  • what do you mean by non-invasive debugger? WinDbg and cdb is essentially the same debugger. WinDbg has UI, cdb is console, but in all other respects they are identical. I had my service start under cdb which opened a tcp port so I could use windbg to connect to it later. Also, as you mentioned, I used .logopen to make sure whatever cdb sees, I could also see through a log file. – DXM Jan 17 '12 at 02:29
  • use the -pv command for cdb to attach non-invasively. You won't be able to stop the threads or do any real actions, but you should be able to see more diagnostic information and hopefully not cause any performance issues that prevent the issue from occurring. Here's a link on non-invasive debug http://msdn.microsoft.com/en-us/library/ff552274(v=vs.85).aspx – Zipper Jan 17 '12 at 04:17