Crash dump from non-reproducible crash, techniques to gain further information?

Question

I have a crash dump from a customer of ours who's experiencing an issue we can't reproduce and neither can they but when they release their product to the end-user it typically crashes. Because of this it's been very difficult to decipher what's going on, we got a crash dump from them, let me say that I'm still learning my way around WinDbg and crash dump analysis for that matter. It's a .net app that interops our unmanaged dll in. I don't see our module listed on any of the call stacks of the threads at the time of the crash so at first glance it appears not to be our fault. Also the customer can't share the actual application, nor can they send even a reasonable sample that mimics what they're doing due to security restrictions.

But the end-user only recently started experiencing the issue after they upgraded to a more recent release, so although it doesn't prove we're at fault, it seems highly likely.

So I'm not expecting a magic answer to my problem, I'm more or less looking for techniques or an approach to root-causing such a crash using only the dump.

I suspect it's heap corruption, so the actual corruption occurs but doesn't bring down the process until much later. The call stack of the suspected thread doesn't give us much to go on, looks like something is being freed that shouldn't be and an Access Violation is reported.

One thing of note is that the exception context record (.ecxr command), seems to be trashed. So esp=0 and ebp=0, that makes me wonder if I can even gain anything of value from this crash dump because in my experience until this point I can usually get a valid call stack from .ecxr. But if I look at the call stack of the suspected thread I get a valid call stack. The debugger's heuristics (!analyze) don't give me much insight either other than some memory was freed that shouldn't have been.

One good idea was to have the customer enable Page Heap in GFlags.exe to catch the corruption as soon as it happens if it does but due to the customer's setup that probably won't happen. So I have to make the assumption this crash dump is all I'm ever going to get from them, and I have to solve the issue with that alone.

I find myself spinning my wheels on this and think maybe if I read some stories of some terribly difficult crash dump analyses can be shared with me that it can give me a new path to try. I can read some assembly but it would seem to me experts in this area have many techniques up their sleeves before resorting to this and I'm hoping maybe they can share some with me.

If this is a managed app .ecxr and !analyze might not be helpful. If there's a managed exception you should be able to find that using the !threads command from SOS. — Brian Rasmussen, Feb 27 '14 at 17:19
!pe reports no managed thread exception or current thread is unmanaged for all of the threads. — JosephA, Feb 27 '14 at 18:04
Sorry, I missed that you actually have an AccessViolation. That's usually tricky as it can be caused by lots of weird scenarios. Can you get anything useful from any of the native stacks? — Brian Rasmussen, Feb 27 '14 at 18:07
Unfortunately the native stacks don't reveal the root cause, the corruption likely occurred long before the crash occurs. — JosephA, Feb 27 '14 at 20:22
I'm sorry I'm not able to provide more input here. Problems like this are hard to troubleshoot. Good luck! — Brian Rasmussen, Feb 27 '14 at 20:35
Some questions: Your unmanaged DLL is used in a .NET environment, but all threads are unmanaged? If you say the .NET app *interops* your DLL, does it mean it is COM? Or is it native P/Invoke? Did you write instructions on how to use your DLL in .NET (Dispose pattern in COM or provide DllImport declaration for P/Invoke)? Do you have a sample .NET application which you can use to try things? Have you reviewed static code analysis output of your native (C++?) DLL? Have you run your DLL in a sample app with GFlags enabled? — Thomas Weller, Feb 28 '14 at 03:15
COM Interop is being used by the customer, not P/Invoke. I have tested using our dll in .NET applications without being able to reproduce any issues. I've also attempted to use GFlags to try to track the problem down on our end but no luck. — JosephA, Feb 28 '14 at 16:39
why don't you provide at least the output from windbg with the call stack — steve, Mar 03 '14 at 00:58
We have found corruptions before by padding allocations with an known pattern, so we can walk backwards through the corruption till we see a pattern we recognize, identifying who likely wrote to the memory (incorrectly). I also heard of people padding memory with an extra page at the start and end of the allocation and then un-mapping the virtual memory of these pages, so an overwrite causes an access violation which reveals the culprit. You could also checkout application verifier which can help in these cases. — Chris, Mar 07 '14 at 05:06

Crash dump from non-reproducible crash, techniques to gain further information?

0 Answers0