1

Recently, we've had a 64-bit .NET 4.0 process with some unmanaged code crash by simply disappearing. No event viewer entries, no windows error dialogs, and our current logging and trace statements don't indicate anything obvious. The code base is very large, so adding additional trace statements will definitely be time-consuming.

We have several third-party DLLs in use, but we have access to all the PDB files we need. The crash happens frequently throughout the day, but not at regular intervals. Our group suspects some mishandied multicast traffic might be the cause, but we're not 100% sure.

We've used ADPlus to debug the process in crash mode:

adplus -crash -p <pid> -o c:\temp

and we've been getting some very strange behavior ... the last minidump when the crash occurs is a first chance "CONTRL_C_OR_Debug_Break exception"; we most certainly are not hitting "ctrl+C". Every time we've attached the debugger, we've gotten this minidump anywhere from 10 minutes to 2 hours after launch. No second chance exceptions, and no out-of-control memory or CPU.

I am admittedly a novice when it comes to CDB/ADPlus/WinDbg, but I know at least a few windbg/SOS commands to swim around a few crash dumps; on this minidump, I am stumped.

Am I going about diagnosing this problem the right way? What else can we do?

UPDATE

After getting correct windows server 2008 symbol files, this appears to be the stack. What's the best way to hunt down possible heap corruption?

0:039> k
  *** Stack trace for last set context - .thread/.cxr resets it
Child-SP          RetAddr           Call Site
00000000`2d06f4f0 00000000`77834736 ntdll!RtlReportCriticalFailure+0x2f
00000000`2d06f5c0 00000000`77835942 ntdll!RtlpReportHeapFailure+0x26
00000000`2d06f5f0 00000000`778375f4 ntdll!RtlpHeapHandleError+0x12
00000000`2d06f620 00000000`777ddc8f ntdll!RtlpLogHeapFailure+0xa4
00000000`2d06f650 00000000`7767307a ntdll! ?? ::FNODOBFM::`string'+0x10c54
00000000`2d06f6d0 00000000`72a88cc4 kernel32!HeapFree+0xa
00000000`2d06f700 00000000`6ea37ffb msvcr100!free+0x1c
00000000`2d06f730 00000000`eb692d6c jvm+0x187ffb
00000000`2d06f738 00000000`2d06f7a8 0xeb692d6c
00000000`2d06f740 00000000`00000000 0x2d06f7a8

UPDATE 2

It turns out a combination of our app + newer version of jdk was indeed corrupting the heap. Caught the crash dump by setting in gflags:

gflags -p /enable MyProcess.exe /full

Still not sure exactly why, but downgrading our jvm actually fixed the problem for now. Big thanks to @MarcSherman and @SevaTitov for helping in comments.

  • Have you tried DebugDiag http://www.microsoft.com/en-us/download/details.aspx?id=26798 – Naveen Apr 18 '13 at 19:27
  • I just did. Doesn't tell us much else than what we've gotten through normal WinDbg usage. – Steve MacCrory Apr 18 '13 at 20:20
  • 1
    Setup [Paged Heap](http://msdn.microsoft.com/en-us/library/windows/hardware/ff549561(v=vs.85).aspx) for your executable, and run it under debugger. This will get you crash right at the instruction that cases heap corruption. – seva titov Apr 19 '13 at 00:49
  • 1
    Normal page heap won't crash right at the instruction. You'll need full page heap for that. Additionally, you don't have to run it under the debugger. Post mortem analyis with the resulting crash dump is sufficient. – Marc Sherman Apr 19 '13 at 14:03
  • Thank you, Mark and Seva. just ran **gflags /p /enable /full** and running the process now. We'll see how it goes. – Steve MacCrory Apr 19 '13 at 14:09
  • @MarcSherman thank you for your help! If you want to answer the question, go ahead. Turned on full heap verification, and it turns out the jvm itself was corrupting the heap. Still trying to find out why, but reverting to an older version of the jdk solves the problem for now. – Steve MacCrory Apr 22 '13 at 18:33

1 Answers1

1

Here's what i did to find the root of the heap corruption:

  1. Installed Debugging Tools for Windows as a "Standalone" component.
  2. Enabled full heap verification with gflags:

    gflags -p /enable MyProcess.exe /full
    
  3. Caught the resulting crash dump with ADPlus:

    adplus.exe -crash -o <outputdirectory> -p <PID>
    
  4. Opened the resulting crash dump in WinDbg and ran:

    !analyze -v
    

Thanks for @MarcSherman and @SevaTitov in comments.