Optimized GCC output for ARM A7 malfunctioning

Question

Attention ARM assembly gurus!

I have some code that surprisingly fails when optimized. I have it isolated in a simple unit test on my embedded device. The code is built for an ARM A7 embedded environment, using gcc-arm-8.3-2019.03-x86_64-arm-eabi.

The code is a basic binary tree search, as part of a template:

map_node_s<KEY, VALUE> * _getMutableNode(
    const KEY & key, 
    OUT map_node_s<KEY, VALUE> *** ptrToChildLink,
    OUT map_node_s<KEY, VALUE> ** ptrToParent
    )
{
    map_node_s<KEY, VALUE> * parent = NULL;
    map_node_s<KEY, VALUE> ** linkToChild = &_root;
    map_node_s<KEY, VALUE> * current = *linkToChild;

    while (current)
    {
        if (current->key == key)
        {
            break;
        }

        parent = current;
        if (key < current->key)
        {
            linkToChild = &parent->left;
            current = parent->left;
        }
        else
        {
            linkToChild = &parent->right;
            current = parent->right;
        }
    }

    *ptrToParent = parent;
    *ptrToChildLink = linkToChild;
    return current;
}

This is called by a method _getStoredKey and KEY and VALUE are both int.

The optimizer (O2) produces the following, which appears correct:

    map_node_s<KEY, VALUE> * _getStoredKey(const KEY & key)
63174910: e92d 43f7     stmdb sp!, {r0, r1, r2, r4, r5, r6, r7, r8, r9, lr}
63174914: 4604          mov   r4, r0
        map_node_s<KEY, VALUE> * current = *linkToChild;
63174916: 6843          ldr   r3, [r0, #4]
    map_node_s<KEY, VALUE> * _getStoredKey(const KEY & key)
63174918: 4689          mov   r9, r1
        map_node_s<KEY, VALUE> ** linkToChild = &_root;
6317491a: 1d07          adds    r7, r0, #4
        map_node_s<KEY, VALUE> * parent = NULL;
6317491c: f04f 0800     mov.w r8, #0
        while (current)
63174920: b17b          cbz   r3, 63174942 <_ZNSt3mapImSt4pairImSt6vectorIhN3csd9allocatorIhEEEEE13_getStoredKeyERKm+0x32>
            if (current->key == key)
63174922: 6819          ldr   r1, [r3, #0]
63174924: f8d9 2000     ldr.w r2, [r9]
63174928: 4291          cmp   r1, r2
6317492a: d00a          beq.n 63174942 <_ZNSt3mapImSt4pairImSt6vectorIhN3csd9allocatorIhEEEEE13_getStoredKeyERKm+0x32>
                current = parent->right;
6317492c: e9d3 2106     ldrd    r2, r1, [r3, #24]
                linkToChild = &parent->left;
63174930: bf8e          itee    hi
63174932: f103 0718     addhi.w   r7, r3, #24
                linkToChild = &parent->right;
63174936: f103 071c     addls.w   r7, r3, #28
                current = parent->right;
6317493a: 460a          movls r2, r1
6317493c: 4698          mov   r8, r3
6317493e: 4613          mov   r3, r2
63174940: e7ee          b.n   63174920 <_ZNSt3mapImSt4pairImSt6vectorIhN3csd9allocatorIhEEEEE13_getStoredKeyERKm+0x10>
        current = _getMutableNode(key, &linkToChild, &parent);
63174942: 9301          str   r3, [sp, #4]

But when run, the result is wrong sorting, unless I turn off optimization for the method!! The non-optimized assembly looks completely different.

The code will also work if I add a breakpoint. I don't know how that impacts the instruction pipeline, or I-cache, or something else.

Also, if I make a small adjustment, the code will work:

        if (key < current->key)
        {
            linkToChild = &parent->left;
        }
        else
        {
            linkToChild = &parent->right;
        }
        current = *linkToChild;

which emits

6317ff34:   b168        cbz r0, 6317ff52 <_ZNSt3mapIiiE15_getMutableNodeERKiPPPSt10map_node_sIiiES6_+0x28>
            if (current->key == key)
6317ff36:   680b        ldr r3, [r1, #0]
6317ff38:   6806        ldr r6, [r0, #0]
6317ff3a:   429e        cmp r6, r3
6317ff3c:   d009        beq.n   6317ff52 <_ZNSt3mapIiiE15_getMutableNodeERKiPPPSt10map_node_sIiiES6_+0x28>
                linkToChild = &parent->left;
6317ff3e:   bfc7        ittee   gt
6317ff40:   68c6        ldrgt   r6, [r0, #12]
6317ff42:   f100 040c   addgt.w r4, r0, #12
                linkToChild = &parent->right;
6317ff46:   6906        ldrle   r6, [r0, #16]
6317ff48:   f100 0410   addle.w r4, r0, #16
    map_node_s<KEY, VALUE> * _getMutableNode(
6317ff4c:   4605        mov r5, r0
6317ff4e:   4630        mov r0, r6
6317ff50:   e7f0        b.n 6317ff34 <_ZNSt3mapIiiE15_getMutableNodeERKiPPPSt10map_node_sIiiES6_+0xa>

Is there something wrong with the compiler-generated assembly?

Are there some additional asm directives I can insert to try and check for an execution issue?

What is the underlying type of `KEY`? The first piece of assembly appears to treat it as an unsigned type, while the second treats it as a signed type. — Michael, Oct 15 '19 at 15:53
In your O2 asm I can see `current = parent->right;` **twice**, and no `current = parent->left;` - you compile a different source code from what you've shown. — tum_, Oct 15 '19 at 16:00
@tum: AFAICT `ldrd r2, r1, [r3, #24]` sets `r2 = parent->left` and `r1 = parent->right`, and then the `movls r2, r1` is supposed to take care of changing `r2` to `parent->right` if `current->key <= key`. Except that `ls` might be the wrong condition if `KEY` is a signed type. — Michael, Oct 15 '19 at 16:08
@tum_, it is optimized, so you have to look very closely. addhi is dealing with left, and addls/movls is dealing with right. — jws, Oct 15 '19 at 16:09
It is a signed type, but gcc might know that all the values are unsigned. I will modify the test values to have some negatives and check. — jws, Oct 15 '19 at 16:11
For what it's worth, I tried all compilers back to 2017 and they all emit the same optimized code. — jws, Oct 15 '19 at 16:20
It fails at random number of operations, suggesting timing is involved. — jws, Oct 15 '19 at 16:24
Are you sure that the _only_ difference between the two assembly snippets you've shown was how you assigned `current`? I'm asking because the memory layout of the nodes appears to be different, and I don't see why that would be affected by that change. — Michael, Oct 15 '19 at 16:29
@jws: usually code that breaks only with optimization enabled contains UB. I'd suggest trying `-O2 -fsanitize=undefined -Wall`. Although the UB sanitizer might need somewhere to print; I don't know if your embedded test platform has a stderr. — Peter Cordes, Oct 15 '19 at 16:52
The image has a lot more code than is executed in this test, hence the address map shifts based on optimization. The template is used with several other KEY and VALUE types, so when I change optimization, it impacts about a dozen of instantiations. The image launches a simple shell, where I invoke only a unit test that only runs code above. Other things do exist in the background, such as IRQ handlers for I/O, the RTOS scheduler tick, etc. - I don't really have a way to isolate it further. — jws, Oct 15 '19 at 16:59
Looks like this is an RTOS bug (Zephyr). I caught register corruption from the scheduler. Geeze — jws, Oct 15 '19 at 17:35

Optimized GCC output for ARM A7 malfunctioning

0 Answers0