0

I have a very strange crash on ARM linux platform caused by simple code. The problem is that it reproduces rarely (once a day) and another problem is that it crashes where it actually cannot.

Let's start from C++ code. Thread function does this:

    event_obj events[EVENTS_MAX]; // EVENTS_MAX = 32
    int num = 0;
    m_engine->getEvents(events, &num);

engine is pointer to base abstract class which has only one implementation at the moment. getEvents is pure virtual method.

getEvents after some changes does nothing but this

int engine::getEvents(event_obj*, int* num)
{
    if (num != nullptr)
    {
        *num = 0; // SEGMENTATION FAULT
    }
    return 1; // ok
}

SEGFAULT happens when trying to store 0 in num. First I thought it is stack corruption, but after I checked generated assembler code it seems that nothing is stored in stack here. This method doesn't even have stack protection generated (-fstack-protector-strong is enabled), both parameters are stored in registers r1 and r2. Let's see the code for function call:

        event_obj events[EVENTS_MAX];
        int num = 0;
   236f8:       2300            movs    r3, #0
   236fa:       ac06            add     r4, sp, #24
   236fc:       9306            str     r3, [sp, #24]
        m_engine->getEvents(events, &num);
   236fe:       6803            ldr     r3, [r0, #0]
   23700:       691b            ldr     r3, [r3, #16]
   23702:       4622            mov     r2, r4
   23704:       a90c            add     r1, sp, #48     ; 0x30
   23706:       4798            blx     r3

and the code for the function itself:

int engine::getEvents(event_obj*, int* num)
{
    if (num != nullptr)
   251f8:       4613            mov     r3, r2
   251fa:       b10a            cbz     r2, 25200 <_Z18engine_thread_funcPv+0x9e0>
    {
        *num = 0;
   251fc:       2200            movs    r2, #0
   251fe:       601a            str     r2, [r3, #0]
    }
    return 1; // ok
}
   25200:       2001            movs    r0, #1
   25202:       4770            bx      lr
    return 1; // ok
}

as you can see from the generated code, pointers are put int r1 and r2 registers.

   23702:       4622            mov     r2, r4
   23704:       a90c            add     r1, sp, #48     ; 0x30

Even if stack is corrupted, it may corrupt value for num variable, but how can it corrupt pointer in register? Also from crash log I can see that LR address is wrong.

CRASH signal 11 Segmentation fault address 0xf0000000 PC 0x251fe LR 0x6c3c533c

The only thing I cannot see from here is the address of jump (blx r3), because called method is virtual. I have one very unlikely assumption that instead of jumping to the first line of virtual method body, it jumped to few lines prior to that and corrupted registers, but I don't get how is it possible. Also it crashes always at the same line, even after changing the code. That is very strange.

Can someone suggest something to try? Any ideas?

Thanks in advance.

incognito
  • 457
  • 5
  • 19
  • 3
    The only thing you can try is to produce a [mcve]. Without a [mcve] nobody on stackoverflow.com will be able to help you. Just because a program crashes at a particular point doesn't mean that's where the bug is. The bug can be anywhere. Welcome to C++. For example, if erroneous logic results in infinite recursion, this could be where the allowed stack space gets exceeded, resulting in a page fault due to an attempted write to an unmapped page. Just one example of many possibilities, and without a [mcve] nothing else can be said. – Sam Varshavchik Oct 19 '17 at 12:23
  • `<_Z18engine_thread_funcPv+0x9e0>` is also rather suspicious. The disassmbler is putting in a rather strange address for what should be `engine::getEvents`. – MSalters Oct 19 '17 at 12:23
  • 1
    _The problem is that it reproduces rarely (once a day) and another problem is that it crashes where it actually cannot_ most likely UB somwhere. –  Oct 19 '17 at 12:27
  • The code you've described is not the problem. Odds are, some other code is exhibiting undefined behaviour, and that happens to do something which causes the code you've shown to break. As per Sam's comment, you need to focus on producing an [mcve]. In the process of doing that, you'll either have an "Aha!" moment and solve the problem yourself, or you'll wind up with a sample of code that someone else can work with which exhibits your problem. – Peter Oct 19 '17 at 12:34
  • I understand that C++ code is fine here and this is obviously UB somewhere else. That is pretty much clear, but the code is big and I can't understand what exactly should I look for to find the issue. That's why I am trying to analyze generated assembly code in order to understand what could go wrong here and I cannot see the reason. Crash is happening here, so UB somewhere else definitely affected this part of code. – incognito Oct 19 '17 at 12:49
  • Finding minimal verifiable example would take few years, because each verification will take a day or two. – incognito Oct 19 '17 at 12:51
  • You can try running valgrind's memcheck. This will usually identify operations that access memory in some way that causes UB. – Knoep Oct 19 '17 at 12:59
  • @Knoep, thanks. That is what we are going to try, but performance on the device is too poor for valgrind, we tried once and it didn't work properly. Also what makes me think it will not help is that memory is not used here at all. Both function parameters are put in registers, not in stack (see disassembly) and pointer is corrupted in register r2 – incognito Oct 19 '17 at 13:01
  • It is hard to advice something useful without the full code. But you say that it reproduces rarely and it crashes where it actually cannot. This strange behavior can often be a symptom of problems with threads synchronization (maybe race condition). If your program uses multithreading then I would advice to look at threads synchronization. – Sandro Oct 19 '17 at 13:13
  • @Sandro, full code is few hundred thousands of lines, I can't put everything here and you wouldn't like to see it either. I thought about thread synchronization too, but each thread deals with it's own stack and context switching stores all registers, so other threads shouldn't affect behavior of this thread at all (at least not at this point). If you see how they can, that's a thing to discuss, let me know what kind of things could affect registers of another thread and I will check if there is something like that in the code. I just can't think of any. – incognito Oct 19 '17 at 13:23
  • run it under valgrind if you can – pm100 Oct 19 '17 at 16:49

1 Answers1

2

The fault occurs because engine is no longer valid. The method containing engine probably been deallocated - ie, your thread memory is gone. As such, engine-getevents is not even valid in memory. Something happened somewhere else in your code and the threads should have stopped running - and exited. They havent. This is much like a callback into an application that is exiting.

Jack
  • 94
  • 2
  • Thanks! This is something to check. I will try to check if this is really possible and add additional logging for object deallocation. Thanks for idea. – incognito Oct 20 '17 at 06:50
  • I looked at the code and seems that I have already checked that before. engine is created in constructor before creating a thread and deleted right after pthread_join, so this thread cannot call getEvents on deleted object. I am wondering that maybe call to getEvents was made from another place? From code I see that this is the only place where it is called, but what if due to UB it was called instead of another function? – incognito Oct 20 '17 at 10:27