Crash with all threads running SIGSEGV handler

Question

We develop a user-space process running on Linux 3.4.11 in an embedded MIPS system. The process creates multiple (>10) threads using pthreads. The process has a SIGSEGV signal handler which, among other things, generates a log message which goes to our log file. As part of this flow, it acquires a semaphore (bad, I know...).

During our testing the process appeared to hang. We're currently unable to build gdb for the target platform, so I wrote a CLI tool that uses ptrace to extract the register values and USER data using PTRACE_PEEKUSR.

What surprised me to see is that all of our threads were inside our crash handler, trying to acquire the semaphore. This (obviously?) indicates a deadlock on the semaphore, which means that a thread died while holding it. When I dug up the stack, it seemed that almost all of the threads (except one) were in a blocking call (recv, poll, sleep) when the signal handler started running. Manual stack reconstruction on MIPS is a pain so we have not fully done it yet. One thread appeared to be in the middle of a malloc call, which to me indicates that it crashed due to a heap corruption.

A couple of things are still unclear:

1) Assuming one thread crashed in malloc, why would all other threads be running the SIGSEGV handler? As I understand it, a SIGSEGV signal is delivered to the faulting thread, no? Does it mean that each and every one of our threads crashed?

2) Looking at the sigcontext struct for MIPS, it seems it does not contain the memory address which was accessed (badaddr). Is there another place that has it? I couldn't find it anywhere, but it seemed odd to me that it would not be available.

And of course, if anyone can suggest ways to continue the analysis, it would be appreciated!

score 2 · Accepted Answer · answered Feb 27 '19 at 07:11

2

Yes, it is likely that all of your threads crashed in turn, assuming that you have captured the thread state correctly.

siginfo_t has a si_addr member, which should give you the address of the fault. Whether your kernel fills that in is a different matter.

In-process crash handlers will always be unreliable. You should use an out-of-process handler, and set kernel.core_pattern to invoke it. In current kernels, it is not necessary to write the core file to disk; you can either read the core file from standard input, or just map the process memory of the zombie process (which is still available when the kernel invokes the crash handler).

answered Feb 27 '19 at 07:11

Florian Weimer

32,022
3
48
92

Is it not surprising that all threads crashed? Why wouldn't only a single offending thread get the SIGSEGV? – YSK Feb 27 '19 at 07:34
That entirely depends on the memory corruption. If shared data is corrupted, then it can happen easily that all threads crash in turn. – Florian Weimer Feb 27 '19 at 18:26
Guess you're right about the corruption. Following your `siginfo_t` tip, I searched the stack for `00 00 00 0B` (since SIGSEGV - signal 11) and found this shortly above the `$sp` of the crash: `00 00 00 0B | 00 00 00 00 | 00 00 00 00 | 00 00 10 88 | 00 00 00 00 | 00 00 00 00` - which is fascinating since the `00 00 10 88` is the PID of *another* thread in the process, not the one that crashed. Also, according to https://elixir.bootlin.com/linux/v3.4.11/source/arch/mips/include/asm/siginfo.h#L37, `siginfo_t` for SIGSEGV does not have a PID at all. So I'm confused... am I misinterpreting? – YSK Feb 27 '19 at 20:45

Crash with all threads running SIGSEGV handler

1 Answers1