0

We have a custom application we use built around VB/C++ code. This code will run for days, weeks, months, without this throwing up an exception error.

I'm trying to learn more about how this error is thrown, and how to interpret (if you can) the error listed when an exception is thrown. I've googled some information and read the Microsoft provided error description, but I'm still stuck with the task of trouble shooting something that occurs once in a blue moon. There is no known set of interactions with the software that causes this and appears to happen randomly.

Is the first exception the root cause? Is it all the way down the stack call? Can anyone provide any insight on how to read these codes so I could interpret where I actually need to look.

Any information or guidance on reading the exception or making any use of it, and then trouble shooting it would be helpful. The test below is copied from windows log when the event was thrown.

Thanks in advance for any help.

Application: Epic.exe Framework Version: v4.0.30319 Description: The process was terminated due to an unhandled exception. Exception Info: System.AccessViolationException [![enter image description here][1]][1]
at MemMap.ComBuf.IsCharAvailable(Int32) 
at HMI.frmPmacStat.RefreshTimer_Elapsed(System.Object, System.Timers.ElapsedEventArgs) 
at System.Timers.Timer.MyTimerCallback(System.Object) 
at System.Threading.TimerQueueTimer.CallCallbackInContext(System.Object) 
at System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, 
System.Threading.ContextCallback, System.Object, Boolean) 
at System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, 
System.Threading.ContextCallback, System.Object, Boolean) 
at System.Threading.TimerQueueTimer.CallCallback() 
at System.Threading.TimerQueueTimer.Fire() 
at System.Threading.TimerQueue.FireQueuedTimerCompletion(System.Object) 
at System.Threading.QueueUserWorkItemCallback.System.Threading.IThreadPoolWorkItem.ExecuteWorkItem() 
at System.Threading.ThreadPoolWorkQueue.Dispatch() 
at System.Threading._ThreadPoolWaitCallback.PerformWaitCallback()

enter image description hereenter image description here

Ricky Layman
  • 21
  • 1
  • 8
  • Do you have a crash dump that you can open in your IDE? It will show you the state of your program when it crashed. – Simple Sep 13 '17 at 14:20
  • I'm not sure... not a computer programmer by trade, just an electrical engineer who has to dabble in it sometimes. I have attached the windows logs in relation, and the application runs from a .exe on a customers machine in the field. – Ricky Layman Sep 13 '17 at 14:32
  • 1
    In general, this error is thrown any time the program tries to access an invalid memory address. There are any number of classes of bugs that might lead to it (e.g. buffer overrun, use after free). They generally involve some kind of undefined behavior, which is undefined partly because it's impractical for the compiler to identify them. – Craig Sep 13 '17 at 14:34
  • maybe this thread can help you https://stackoverflow.com/questions/12228297/automatically-create-visual-c-crash-dump' – 0xBADF00 Sep 13 '17 at 14:37
  • 1
    You are going to need an experienced programmer to get this problem resolved. Preferably whomever worked on the project. Not just because AVE is a very gritty mishap, this also looks like a threading bug. – Hans Passant Sep 13 '17 at 15:50

2 Answers2

2

There are exceptions that are thrown by the c++ runtime environment, as a result of executing a throw expression, and there are other types of errors caused by the operating system or hardware trapping your instruction. Invalid access to memory is generally not thrown by code in c++, but is a side effect of evaluating an expression trying to access memory at an invalid address, resulting in the OS signaling the process, usually killing it. Because it's outside C++, it tends to be platform specific, but typical errors are:

  • reading a null pointer
  • using a pointer to an object that has been deleted
  • going outside an array's valid range of elements
  • using invalidated iterators into STL containers

Generally speaking, you can test for null and array bounds at runtime to detect the problem before it happens. Using a dangling pointer is more difficult to track down, because the time between the delete and the mis-use of that pointer can be long, and it can be difficult to find why it happened without a memory debugger, such as valgrind. Using smart pointers instead of raw pointers can help mitigate the problems of mis-managing memory, and can help avoid such errors.

Invalid iterators are subset of the general dangling pointer problem, but are common enough to be worth mentioning as their own category. Understanding your containers and which operations invalidate them is crucial, and some implementations can be compiled in "debug mode" which can help detect use of invalidated iterators.

Chris Uzdavinis
  • 6,022
  • 9
  • 16
1

As others have noted, this type of error is tricky to identify without digging into the code and running tests (automated or manual). The more pieces of the system you can pull out and still reproduce it, the better. Divide and conquer is your friend here.

Beyond that, it all depends how important it is for you to resolve this and how much effort you're willing to put in. There are at least three classes of tools that can help with such intermittent problems:

  1. Application monitors that track potential errors as your application runs. These tend to slow your program significantly (10x or more slowdown). Examples include:
    1. Microsoft's Application Verifier
    2. Open-source and cross-platform Dr. Memory
    3. Google's Crashpad. Unlike the previous two programs, this one requires instrumenting your code. It is also (allegedly -- haven't tried it) easier to use with helpers like Backtrace's commercial integration for analyzing Crashpad output
    4. Google's Sanitizers - free and some are built into gcc and clang. There's also a Windows port of Address Sanitizer, but a cursory look suggests it may be a little bit of a second-class citizen.
    5. If you can run and repro it also run it on Linux, you could use valgrind; rr (see this CppCast ep) which is a free extension for gdb that records and replays your program so you can record a debug session that crashed and then step through it to see what went wrong; and/or UndoDB and friends from Undo software, which is a more sophisticated, commercial product like rr.
  2. Static analysis of the code. This is a set of tools that looks for common errors in your code. It generally has a low signal-to-noise ratio, so there are a lot of minor things to dig through if your run it on a large, existing project (best to start a project using these things from the beginning if possible). That said, many of the warnings are invaluable. Examples:
    1. Most compilers have a subset of this functionality built in. If you're using Visual Studio, add /W4 /WX to your compilation flags for the C++ code to force maximum warnings, then fix all the warnings. For gcc and clang, add '-Wall -Wpedantic -Werror` to enforce no warnings.
    2. PVS-Studio (commercial)
    3. PC-Lint (commercial)
  3. If you can instrument the code to write log messages, something like Debugview++ may be of assistance.

Things get harder if you have multithreading going on, which it looks like you do, because the non-determinism gets harder to track, there are new classes of possible errors that are introduced, and some of the above tools won't work well (e.g., I think rr is single-threaded only). Beyond a full IDE like Visual Studio, you'd need to go with something like Intel's Inspector (formerly Intel Thread Checker), or on Linux, Valgrind's Helgrind and DRD and ThreadSanitizer (in the sanitizers above, but also Linux only AFAIK). But hopefully this list gives you a place to start.

metal
  • 6,202
  • 1
  • 34
  • 49
  • I cannot run the majority of example 1 in practice. Many of these c++ codes are talking to real world hard ware devices worth over a grand total sum of hundreds of thousands of dollars. I may get access for a few days or maybe even a week if im lucky, but the error often goes months (like 6 months or more pretty commonly) without occuring. I'm not sure if these diagnostic tools would flush that out in a shorter time constraint? Is there any indicator from those fault logs where to look specifically? Or would the problem be in any of those listed locations? – Ricky Layman Sep 13 '17 at 21:46
  • The exception details are not particularly helpful in cases like this because a wayward pointer or threading bug can corrupt other parts of memory far afield from what the program is doing at the time. Many times, I've seen the program crash in some seemingly unrelated place because of a problem like yours. Possibly the best thing you can do in the circumstances you describe is instrument the code to log program state (tool 3) and then compare good and bad runs. This won't be fast or accurate, but it may be all you've got. – metal Sep 14 '17 at 15:00
  • Really, you and your bosses need to decide how important it is to fix. It is not likely going to be a simple fix, so count the cost. Are there other workarounds -- e.g. automatically restarting the process when it exits -- that would be acceptable? If not, what sort of investment will it take to fix? In theory, hardware can be spoofed in testing by dependency injection. Your biggest problem to overcome is the indeterminacy and intermittency. – metal Sep 14 '17 at 15:00
  • I think the only threading in the application is that generated by VS to do it's timers. Many of the forms are updated by timers to show difference in machine states on the fly, for example on the form it mentions the crash in there are a few or more timers running for different update loops. Similar to what you said though, the form it lists in the crash is not the form it's on when the system drops out. I did verify the timer is stopped and disposed of in the form as well, so it may prove very hard to track like you said. – Ricky Layman Sep 15 '17 at 13:21
  • That's good news on the threading. So as I see it, your non-exclusive options are: (1) run it offline in a test environment with spoofed hardware etc. to try to repro, (2) instrument the code to try to capture more details about what is going wrong and when in the production environment, and/or (3) do static analysis of the code to see if it turns up any real bugs. – metal Sep 15 '17 at 14:13