4

My program appears to be suffering from a very hard-to-reproduce bug: Once in a blue moon, when a user puts his Mac to sleep and later on wakes it back up again, one of my program's child processes will crash immediately after the Mac wakes up.

When this happens Apple's crash-reporter mechanism reliably reports a stack trace like this one:

Thread 0 Crashed:: Dispatch queue: com.apple.main-thread
0   libsystem_kernel.dylib          0x967f9a6a __pthread_kill + 10
1   libsystem_c.dylib               0x9003dacf pthread_kill + 101
2   libsystem_c.dylib               0x900744f8 abort + 168
3   com.meyersound.VirtualD-Mitri   0x0014438e muscle::CrashSignalHandler(int) + 190
4   libsystem_c.dylib               0x9002886b _sigtramp + 43
5   ???                             0xffffffff 0 + 4294967295
6   com.meyersound.VirtualD-Mitri   0x001442d0 muscle::ParsePortArg(muscle::Message const&, muscle::String const&, unsigned short&, unsigned long) + 80
7   com.meyersound.VirtualD-Mitri   0x005b3393 qnet::RepDBPeer::Pulse(muscle::PulseNode::PulseArgs const&) + 1187
8   com.meyersound.VirtualD-Mitri   0x0015717b muscle::PulseNode::PulseAux(unsigned long long) + 203
9   com.meyersound.VirtualD-Mitri   0x000cfb90 muscle::ReflectServer::ServerProcessLoop() + 3232
10  com.meyersound.VirtualD-Mitri   0x00607c7e dcasldmain(int, char**) + 2222
11  com.meyersound.VirtualD-Mitri   0x0072c14d dmitridmain(int, char**) + 4749
12  com.meyersound.VirtualD-Mitri   0x0000bc3a main + 4938
13  com.meyersound.VirtualD-Mitri   0x000061ab _start + 209
14  com.meyersound.VirtualD-Mitri   0x000060d9 start + 41

This is all well and good, except (cue eerie music) -- it's logically impossible. In particular, not only does my RepDBPeer::Pulse() method never call ParsePortArg, the crashing process never calls ParsePortArg anywhere! (I grepped all of my source code twice to make sure)

So my question is, just what is this stack trace trying to tell me? Is this most likely a case of Thread 0's stack getting corrupted badly enough that the stack-trace mechanism has gone off the rails and fingered an innocent bystander as the culprit? Or is it possible that Apple's wakeup mechanism somehow "jumped" the program counter into ParsePortArg() (whereupon the resulting confusion caused the crash)? Or is there some other deeper magic going on here that I can't even imagine?

The crashing process in question is a vanilla non-GUI background process that is a child process spawned by a Qt GUI process, for what that's worth.

Jeremy Friesner
  • 70,199
  • 15
  • 131
  • 234
  • 1
    I'm trying to think of situations where a named function can be called without looking like it. Macros and functionpointers immediately jump to mind. The function is obviously defined, can we see the other places where `ParsePortArg` is? – Mooing Duck Apr 26 '13 at 02:36
  • perhaps heavy optimizations. grep will not help if it is passed this function somehow in a variable. – Elazar Apr 26 '13 at 02:39
  • 1
    Exceptions push non-standard frames on the stack, and the unwind code does not always recognize them. – brian beuning Apr 26 '13 at 03:28
  • @Mooing Duck ParsePortArg() is only called once at the top of dccdmain(). dccdmain() is a function that dmitridmain() would call if argv contained a keyword telling it to do so, but in this case it never does (because it calls dcasldmain() instead). (dmitridmain() is a busybox-style delegation layer) – Jeremy Friesner Apr 26 '13 at 03:48
  • How many threads are in use in this program? Is there any chance that an object on the stack of one thread got trashed by a different thread? – jdkoftinoff Apr 26 '13 at 05:48
  • @jdkoftinoff there are 5-10 threads present in the process that crashes. Stack trashing by another thread is a possibility, although I'd be a little bit surprised if that turned out to be the problem, because the stack traces are always the same -- with a cause like that I'd expect more variation in the symptoms. – Jeremy Friesner Apr 26 '13 at 06:38

1 Answers1

1

I'm going to assume you have some amount of optimization turned on. There's no magic to stack traces. They become increasingly more fuzzy (read as "less accurate") once code is inlined or omitted, which is precisely what a C++ optimizer does. In the case of ParsePortArg, there is a +80 at the end of that line, meaning 80 bytes ahead of the entry point of that function in the code segment. This indicates the true address of the instruction pointer at 0x001442d0, and ParsePortArg is the nearest symbol that the stack dump guessed at. You were right to assume it was a red herring.

Barring any other data, I would make the very conservative guess that your program is expecting some pointer to remain valid that is not valid on wakeup from sleep. Look at the disassembly for the instruction at that address. I bet a memory address is trying to be read.

MattD
  • 380
  • 1
  • 15
  • Thanks, this was very helpful. Upon examining the disassembly (via "otool -tv MyApp.app/Contents/MacOS/MyApp" I can see that ParsePortArg() is adjacent to CrashSignalHandler() in the executable, so the offset must have fooled the stack-trace generator into printing ParsePortArg() when it was really the CrashSignalHandler() function on the stack. – Jeremy Friesner Apr 30 '13 at 17:41