In the last few months I've received few reports from QA about one of our services hanging. Upon examining a hang dump using WinDbg, every time I discovered the same thing: Loader lock critical section is locked but owning thread is nowhere to be found. Since the thread is gone and the only trace that I can see is a global critical section it left behind, I don't see what code ran on thread thread, or even what DLL that thread came from, it may not even be one of ours (i.e. third party vendor).
This issue is very sporadic, only seen it maybe 3-4 times over the last 6 months occurring naturally in the wild. All other times, service runs perfectly. So this makes me believe it's some kind of timing/race condition thing.
Recently, I've decided to take it upon myself to figure this one out. I setup a machine with WinTask script that constantly starts/stops the said service. Good news is that within 5-6 hours I can reproduce the problem.
Now for next part: how do I isolate it?
This is what I've tried so far:
used "debugger" field in gflags image settings to automagically run the service under cdb whenever it starts. So far this has been running for two days and never hung, so I'm thinking debugger introduced just enough of a timing change to make the issue invisible.
Downloaded Application Verifier and configured the process to run with that. Found a completely unrelated bug where we create CComBSTR temporary variable, assign it to a VARIANT and pass the variant into a function call even though CComBSTR long deleted the allocated string by that point. Don't believe this bug is related because string is read-only and the thread it's running on isn't the one that's dying.
I'm making this post in case you guys could think of something that I'm not considering.
I though there was a windows utility that artificially put load on the CPU and did other things to make race conditions pop up and I thought application verifier did such a thing, but apparently it doesn't. Does anyone know what I'm taking about, or did I just dream that up?
Unless something happens over the weekend my next step would be to disable all debuggers, go back to stock and hack one of DllMains to record THREAD_ATTACH/THREAD_DETACH events. At least I'll be able to intercept the thread that's dying when it gets created. That might shed some light.