4

I have a somewhat large server process written in .net-3.5, that is, running in a VMWare vCenter Server that keeps crashing without any errors being reported. The process is created by a Windows Service on 32 bit Windows Server 2003, and is intended to be a long running process (multiple days). It is a collaboration process, that accepts connections via Tcp sockets from multiple clients running on other Windows XP machines, and allows them to share data. In addition, the process also self-hosts about 8 WCF services that expose a mixture Tcp & Http endpoints. The process generally consumes about 500 Mb of memory and between 30-50% CPU at all times. There is also an instance of SQL Server 2005 on the same VM that is hosting 6 databases, and consumes about 1-1.2 Gb of memory. The entire system has been allocated 8 Gb of ram, and is consuming as much as 7 Gb during normal operation. I assume PAE is enabled to allow the system to address 8 Gb of ram, but have not confirmed this.

The problem is that, at seemingly random times, the process will suddenly crash with no errors being reported, including in the event log. I've tried attaching debuggers to the process, and they have not caught the crash either. I first tried WinDbg on the release build with symbols loaded, then I replaced all of the release dlls/exes with debug builds and loaded their symbols. The crashes still occurred, and the debugger did not catch them. I next installed Visual Studio on the system with the .Net Reflector add-in, and attached that. It also did not catch the crash.

Before you lecture me on why we're running so many things on a single VM, know that I did not design the system, nor did I implement it this way. Our customer dictated it for specific reasons, and I've been asked to come in and make it work. I'm only interested in criticisms of the environment if you can site specific evidence that would help explain the sudden crashes. Our customer may be willing to alter the environment if we can show such evidence. Any additional debugging techniques that will allow me to capture more information about the crash would be greatly appreciated as well.

Todd
  • 620
  • 4
  • 13
  • I would first include excessive logging (with e.g. NLOG or LOG4NET) to the application and see what is being logged in case of exceptions. – Uwe Keim Apr 17 '11 at 15:00
  • ru using any unmanaged third party libraries? when you attach windbg and the exception is not caught is there anything else of interest in the windbg output? – wal Apr 17 '11 at 15:14
  • Hmm, my first guess (and that's *all* it is) is that it's some kind of out of memory error. Your process is probably eating up a lot of RAM, you're running into some kind of a problem with garbage collection, and the process is terminating. Not sure why you can't catch the error with a debugger. Check to make sure that you aren't leaking memory or handles. What does something like Process Monitor tell you while things are stable? – Cody Gray - on strike Apr 17 '11 at 15:18
  • We have excessive logging (Nlog), and are not seeing any exceptions being logged there when the process crashes. We are also using some unmanaged 3rd party libraries. Nothing of interest is in the debugger output. – Todd Apr 17 '11 at 15:39
  • Here is something interesting that we've been observing. Even though the process & parent service are running as another user account, on at least 3 occasions, we've observed that if I log into the server (to archive logs, restart services, etc.), when I log out the process has crashed at that same moment. It doesn't happen all of the time, but enough to make me think there is something going on. Again, the process is created by a windows service, and runs under its own dedicated user account (not mine). – Todd Apr 17 '11 at 16:25
  • More new information. Apparently, the process crashes every time I log out of the server. There are still no errors reported, but the nlog log manager seems to reset or something, because just before the crash the log header is printed to the log file. What would cause the nlog logmanager to reset like this? – Todd Apr 17 '11 at 19:21
  • is it nlog manager resetting or something upstream crashing which brings down everything (including nlog) ? (rhetorical) – wal Apr 18 '11 at 11:34

3 Answers3

0

A "crash" without output suggests a call to _exit() (or even exit()). I've seen a few corners of the Visual Studio runtime library do that, though they usually get a cryptic message out to stderr. Is stderr captured?

The suspicion of running out of memory also seems likely. If .net has a heapspace()-like function to describe how much memory is being used by the heap, log that periodically, perhaps along with total memory used (code + stack + data). I'm not familiar with .net, but there must be functions to get those values.

wallyk
  • 56,922
  • 16
  • 83
  • 148
0

It turns out that one of the service plugins was seeking out and referencing a Java library. When the user logged out, the plugin crashed the service due to the JVM being terminated. We were able to get everything working again by following the suggestions in this post (starting JVM with the '-Xrs' parameter): http://www.velocityreviews.com/forums/t128371-java-app-dies-on-logoff.html

Todd
  • 620
  • 4
  • 13