1

I'm looking at a core dump from an embedded MIPS Linux app. GDB is reporting SIGBUS, and the thread handling the signal appears to be sat in a syscall for nanosleep - the higher level code basically called sleep(verylongtime); Assuming another process didn't send that signal to the app, what would cause this thread to be woken up like this? Has something inside the kernel generated the bus error? Could it have been caused by another thread that blocks such signals? (please excuse any naivety here, I'm not too knowledgeable about signals). Thanks.

gimmeamilk
  • 2,016
  • 5
  • 24
  • 36
  • Was there anything on dmesg? Were there any other threads running at the time? Was there a SIGBUS signal handler installed on that thread? – bdonlan Sep 26 '11 at 19:02
  • 1
    dmesg log wasn't captured unfortunately. About 30 threads. Originally there was no SIGBUS handler. I added one to diagnose this issue, and si_pid holds what appears to be an address in the .text of the program (!). Is there a scenario where that would happen? I did some experimentation and normally si_pid correctly holds the pid of the sending process, or zero if it was generated by this process. – gimmeamilk Sep 26 '11 at 20:29

1 Answers1

3

If si_pid is set to an address, this means your SIGBUS was raised by a fault in the program. Usually this happens when the kernel tries to page in some program text, but encounters an IO error. Stack overflows can also trigger this.

You see si_pid set to an address because si_pid is part of a union, and is aliased with si_address. In particular, si_pid is only valid if si_code == SI_USER. You may be able to get more information from the si_code member:

   The following values can be placed in si_code for a SIGBUS signal:

       BUS_ADRALN     invalid address alignment

       BUS_ADRERR     nonexistent physical address

       BUS_OBJERR     object-specific hardware error

       BUS_MCEERR_AR (since Linux 2.6.32)
                      Hardware memory error consumed on a machine check; action required.

       BUS_MCEERR_AO (since Linux 2.6.32)
                      Hardware memory error detected in process but not consumed; action optional.

Note that it is not possible to block kernel-originated SIGBUS signals - if you try to do so, your program will be terminated anyway.

I suspect your debugger may be a bit confused as to the origin of the SIGBUS signal here; it may be attributing it to the wrong thread. You may want to examine the other threads of your process to see if they're doing anything odd. Alternately, you may have encountered an IO error when returning from the nanosleep and paging in the page of code at the return address.

bdonlan
  • 224,562
  • 31
  • 268
  • 324
  • Brilliant, thanks for this. I will re-examine the core file and report back. What kind of IO errors could occur? Note: this is an embedded linux environment with memory overcommit disabled. And do you mean stack overflows in user or kernel space? – gimmeamilk Sep 26 '11 at 21:19
  • Stack overflows in userspace. IO errors can mean just about anything that will trigger an `-EIO` on a `read()` – bdonlan Sep 26 '11 at 21:24
  • recently I had an application that received SIGBUS because hard disk failed and kernel apparently couldn't swap in some of memory regions of that process. – thor Dec 19 '12 at 05:44