2

I have a mysterious crash that I'm struggling to locate in a large multi-threaded application compiled in MSVC 2005. The application is in daily use by a client, and any crashes cause significant disruption to them. I need a workaround. If I could isolate the issue to one function, and do something along these lines:

__try
{
  FunctionWhichMayCauseCrash();
}
__except ( [filter expression] )
{
  Recover();  // magic - this allows us to prevent crash and continue
}

then that would seem like a good idea to me in theory. In practice, some people (e.g. Larry Osterman here and Doug Harrison here) make it sound like it might be a very bad idea - that SEH should not be touched with a barge pole.

Reality check: my program is generating structured exceptions, and I know not where. I am using parts of Hans Dietrich's XCrashReport - which itself uses __try/__except - to try to get insights into the source of these exceptions, but with no luck so far. It seems likely that some shared resource is not being properly locked, so that one thread is pulling the rug from beneath another thread, causing an access violation in a more or less random place.

Is there a pragmatic middle ground where such a mechanism could prevent my program from crashing? Should I be concerned that my crash recovery mechanism of choice uses something that others are wary of?

Clarification: because of the extreme disruption caused by program crashes, I seek a workaround that prevents crashes, NOT a final permanent solution. I have no intention of using __try/__except to sweep an issue under the carpet. I am merely trying to understand whether it is as dangerous as some people make it sound, or a legitimate tool that should be used with care. The way some people talk, the very moment I try compiling my code with /EHa defined, my computer will probably burst into flames. I am interested to know whether people would say using /EHa, _set_se_translator and try/catch(...) is better, or amounts to the same thing, or whether both are really bad ideas.

Clarification 2: I don't need help debugging :-) Rather, I need help understanding the implications of mixing SEH and C++, something which seems to generate more heat than light on this and other forums. My low reputation indicates newness to forum, not newness to C++. I deliberately abstracted my application out of the question to encourage people to focus on the implications of introducing SEH constructs to a C++ program. Well that didn't work :-) As it happens, my application has a pipeline of objects any of which I can readily dump if I detect corruption in them. So my magical Recover() function is not nearly as magical as it might sound, and there is a good chance that corruption will be limited to a small part of the heap. So... back to the question: is using __try/__except advisable?

Community
  • 1
  • 1
omatai
  • 3,448
  • 5
  • 47
  • 74
  • possible duplicate of [Windows/C++: Is it possible to find the line of code where exception was thrown having "Exception Offset"](http://stackoverflow.com/questions/2528776/windows-c-is-it-possible-to-find-the-line-of-code-where-exception-was-thrown) – Hans Passant Feb 05 '13 at 03:38
  • @HansPassant I don't think it's a duplicate but your link is very useful. – xxbbcc Feb 05 '13 at 03:42
  • @Hans Not a duplicate - let me edit to add clarity... – omatai Feb 05 '13 at 03:44
  • 1
    It is the only sane way to tackle the problem. Catching SEH exceptions is pointless if you don't even know where to put the __try statement. The minidumps tell you *why* your code crashed. Now you can actually fix it instead of putting a band-aid on it that just makes the program misbehave even worse. – Hans Passant Feb 05 '13 at 04:25
  • @Hans - what is the "it" that is the only sane way to tackle the problem? So far my minidumps have told me nothing useful. The only thing consistent is the inconsistency - stack and/or heap corruption is implicated. However, there have been some clues about where to put __try statements, and is there any harm in putting them in? In fact, does it not help considerably to have my program to limp along for even a millisecond longer so that it can log "exception caught in SomeFunction()" which would at least confirm something that I did not previously know? – omatai Feb 05 '13 at 04:32
  • @omatai If you let the program die after the logging, that's ok. That's not what you've been talking about so far. – xxbbcc Feb 05 '13 at 04:40
  • @xxbbcc - agreed on that point- it has to do with the solution, not the workaround. – omatai Feb 05 '13 at 04:54

2 Answers2

1

Do. Not. Do. This.

I can fully agree with Doug Harrison's comments from your links - using SEH is very dangerous because you end up hiding (possibly severe) errors in your code.

If you have a very specific idea about where the exception may happen, temporarily adding a SEH block to your code may help in tracking it down but I suspect that's not the case - you have a corrupted stack.

I'd recommend against adding SEH blocks to a large part of your program because all it'll do is save the program from crashing at the cost of hiding those problems. You'll hide a crash but you won't know if your application's state has been corrupted (and to what extent) or not. Your client won't be much helped with this if corrupted data gets saved in the database.

Here's another SEH question, I think it may be useful to you.

Instead of trying to use SEH, use your time and energy to try to fix the problem. Using WinDbg (if you have a minidump from the crash) can speed things up. If you're not familiar with it, here's a tutorial.

I'm no SEH expert, so others may be able to give you more detailed advice but I'd only try the SEH solution as a very, very last resort because of the possiblity of even harder-to-find issues.

Community
  • 1
  • 1
xxbbcc
  • 16,930
  • 5
  • 50
  • 83
  • So when you say "I'd recommend against adding SEH blocks to a large part of your program because all it'll do is save the program from crashing", are you not saying that doing this will do PRECISELY what I asked for? :) Let me emphasise again: crashes are EXTREMELY disruptive for my client, and I am seeking pragmatic temporary workaround, not a permanent solution. I expect to be able to run a "normal" version of the program after hours so I can work on finding the cause of the issue. What I'm asking for here is how to make life bearable for my clients until I find the source of the bug. – omatai Feb 05 '13 at 03:41
  • Already doing that, but the minidump indicates the stack is corrupted and excuting nonsense code :( – omatai Feb 05 '13 at 03:55
  • @omatai My guess is that you have heap corruption then. I'd **most definitely** not try to hide a corrupt stack. You can hardly to worse to your code's stability. – xxbbcc Feb 05 '13 at 03:57
  • +1 for reminder about WinDbg. For the rest, I'm not sure. My application has no database - there is little chance of any residual corruption because it is a real-time app in which all data, good or corrupted, is eventually flushed. Data is also processed in a pipeline fashion so that if corruption occurs to any given object, then a __try __except in an affected thread can conceivably recover and prevent other threads using that object. Does this sound so dangerous as to not even try it? Will it inherently introduce bugs? If so, via what mechanism? That's what I really want to understand :-) – omatai Feb 05 '13 at 05:37
  • @omatai The problem is that once the heap is corrupted, it's not any of your objects that matter - the heap manager in the CRT for example is not something you can easily flush. There are a number of other objects in the CRT that may get corrupted as well. Since you don't know what exactly causes the corruption, you really don't have a good place to start. For example, a CRT object may be around one of your pointers when memory gets overwritten - that CRT object is now gone. There's a chance that you can flush it out but you won't know if there's other damage. – xxbbcc Feb 05 '13 at 14:14
  • @omatai Just one more thing: you said that each crash is significant disruption so I assume data you work with in some form is important even without a DB, so in my opionion you'd run a fairly high risk of introducing corruption if you let the program live after a crash. Logging is different, of course. At the end, you have to decide if you do this - I'm simply trying to point out that you look at SEH too optimistically. HTH. – xxbbcc Feb 05 '13 at 14:23
0

The problem is that by the time the SEH is invoked, your app is completely hosed. Any RAM that is accessible to (not just used by) FunctionWhichMayCauseCrash has to be assumed to be destroyed, including everything done in userspace by the CRT. The best thing to do really is to log everything you can get your hands on - in a way that depends solely on kernel functions - abort the entire process, close all of its IPC and shared handles, and start a shiny new process in it's own address space.

If you really want fine-grained crash recovery in this situation, you'll likely need to re-architect to a series of pipe-connected processes or some such.

David O'Riva
  • 696
  • 3
  • 5
  • Is this not being really pessimistic? As it turns out, my application involves lots of objects in a pipeline, and the stack traces I'm getting indicate corruption to objects, but not much (if anything) else. I am able to dump the corrupted objects and continue along the pipeline. So is my app completely hosed only in practice, or only in theory? – omatai Feb 05 '13 at 04:08
  • 1
    @omatai I would say it's more realistic than pessimistic. If your crashing function is truly trivial, just reading the value of a variable and returning it for example, then you can certainly get away with a recovery-type action (if you know why it failed). With anything more complex you would need to properly understand the code the compiler has generated for your function and everything it calls. If you've lost sync with an object's reference count, or have loops that do things with referenced objects, you really could have trashed just about anything including your allocators. – David O'Riva Feb 05 '13 at 04:22