QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

Question

I have this simple C code:

#include "uart.h"
#include <string.h>
char x[32];
__attribute__((noinline))
void foo(void)
{
    strcpy(x, "xxxxxxxxxxxxxxxxxxxxxxxx");
}
int main(void)
{
  uart_puts("xxx\n");
  foo();
  uart_puts("yyy\n");
}

compiled as:

$ aarch64-none-elf-gcc t78.c -mcpu=cortex-a57 -Wall -Wextra -g -O2 -c -std=c11 \
&& aarch64-none-elf-ld -T linker.ld t78.o boot.o uart.o -o kernel.elf

and executed as:

$ qemu-system-aarch64.exe -machine virt -cpu cortex-a57 -nographic -kernel kernel.elf

prints:

xxx

Why yyy is not printed?

By reducing the issue I've found that:

for strcpy GCC generated a code other than "call strcpy" (see below)
ldr q1, [x0] causes yyy to not be printed.

Here is the generated code of foo:

foo:
.LFB0:
        .file 1 "t78.c"
        .loc 1 6 1 view -0
        .cfi_startproc
        .loc 1 7 5 view .LVU1
        adrp    x0, .LC0
        add     x0, x0, :lo12:.LC0
        adrp    x1, .LANCHOR0
        add     x2, x1, :lo12:.LANCHOR0
        ldr     q1, [x0]                     <<== root cause
        ldr     q0, [x0, 9]
        str     q1, [x1, #:lo12:.LANCHOR0]
        str     q0, [x2, 9]
        .loc 1 8 1 is_stmt 0 view .LVU2
        ret

If I put ret before ldr q1, [x0] the yyy is printed (ax expected).

The question: why ldr q1, [x0] causes yyy to not be printed?

Tool versions:

$ aarch64-none-elf-gcc --version
aarch64-none-elf-gcc.exe (Arm GNU Toolchain 12.2.Rel1 (Build arm-12.24)) 12.2.1 20221205

$ qemu-system-aarch64 --version
QEMU emulator version 7.2.0 (v7.2.0-11948-ge6523b71fc-dirty)

I suspect the problem is in the startup code or the linker script, or the uart code. The code in `foo` looks just fine. The compiler has optimized the `strcpy` call into two 16-byte SIMD load/stores, which should be legitimate. So we need a true [mcve]. — Nate Eldredge, Jan 30 '23 at 17:17
@old_timer: At least one of the load/stores is certainly unaligned, given the offset of 9. ARM64 supports that unless a flag is specifically set to make it trap, and the ABI requires that flag to not be set. Now, we don't know what the boot code may actually be doing to that flag, because it isn't shown in the question... — Nate Eldredge, Jan 30 '23 at 17:18
Linker issues are another common source of trouble in bare-metal code. For example, there could be something wrong with the placement or initialization of the `.bss` section; you wouldn't notice anything wrong until the first time your program uses a default-initialized global or `static` variable. Hence my request to see the linker script as well. — Nate Eldredge, Jan 30 '23 at 17:20
I'm also not sure, but there might be a bit that the boot code needs to set to enable the use of SIMD instructions. So, you would want to check whether that is being done. — Nate Eldredge, Jan 30 '23 at 17:22
Yes, this looks like the startup code is probably not enabling the FPU. — Peter Maydell, Jan 31 '23 at 11:20
@NateEldredge I took `linker.ld`, `boot.S`, and `uart.c` from [there](https://github.com/NienfengYao/armv8-bare-metal). Please have a look. — pmor, Jan 31 '23 at 13:56
That is indeed not enabling the FPU. (On real hardware you would probably need to do more things also, but QEMU is a bit more lenient...) — Peter Maydell, Jan 31 '23 at 14:51
@PeterMaydell OK. I enabled FPU using `mrs x1, cpacr_el1 \n mov x0, #(3 << 20) \n orr x0, x1, x0 \n msr cpacr_el1, x0`. Alternatively, FPU can be kept disabled and GCC can be requested (via `-mgeneral-regs-only`) to generate code which uses only the general-purpose registers. Can you post an answer? Extra question: does hardware with FPU enabled but unused consume more energy that hardware with FPU disabled? If so, then how much more? 0.1%? 1%? 5%? — pmor, Feb 01 '23 at 14:52
The question about power consumption could only be answered by your chip vendor, or by testing. — Nate Eldredge, Feb 01 '23 at 15:18
Glad you found something that worked. I've written up an answer which includes a few more things that might be important for startup code. I agree with Nate that power consumption is entirely a matter for the hardware vendor. More generally, if you care about power consumption this is a complex topic and getting good power consumption goes much further than just "don't turn on the FPU". — Peter Maydell, Feb 01 '23 at 15:46

score 1 · Accepted Answer · answered Feb 01 '23 at 15:43

The ldr q1, [x0] instruction is taking an exception because it accesses a floating-point/SIMD register but your startup code does not enable the FPU. The compiler is assuming that it can generate code that uses the FPU, so to meet that assumption one of the things your startup code must do is enable the FPU, via at least CPACR_EL1, and possibly other registers if EL2 or EL3 are enabled.

Alternatively, you could tell the compiler not to emit code that uses the FPU. The Linux kernel takes this approach, using the -mgeneral-regs-only option.

Real hardware probably has more strict requirements for what you need to do to configure the CPU to be able to run C code; QEMU is quite lenient. For instance the architecture defines that the reset value of many system registers is UNKNOWN, though QEMU usually resets them to zero. A robust startup sequence will explicitly set bits in registers like SCTLR_EL1.

You may also need to watch out for whether your compiler and your startup code agree about whether the compiler generated code is allowed to emit unaligned accesses -- if the MMU is not enabled then all memory accesses are treated as of type Device, which means they must be aligned (regardless of SCTLR_EL1.A). So you either need to make sure your compiler doesn't try to emit unaligned loads and stores, or else turn on the MMU and set SCTLR_EL1.A to 0.

You could improve your ability to debug this sort of "exception in early bootup" by installing some exception vectors which do something helpful when an unexpected exception occurs. The ideal is to be able to print registers, especially ELR_EL1 and ESR_EL1, which tell you where and why the exception occurred; printing in early bootup can be tricky, though. An easy compromise is to at least catch the exception and loop; you can then use gdb to see what the CPU state is.

score 0 · Answer 2 · answered Feb 02 '23 at 11:53

0

An addition to answer by Peter Maydell.

Here is the code that enables FPU (found here):

mrs    x1, cpacr_el1
mov    x0, #(3 << 20)
orr    x0, x1, x0
msr    cpacr_el1, x0

answered Feb 02 '23 at 11:53

pmor

5,392
4
17
36

1

Note that `3 << 20` is a valid bitmask immediate, so the middle two instructions can be replaced with simply `orr x0, x1, #(3 << 20)`. – Nate Eldredge Feb 04 '23 at 07:51

QEMU for AArch64: why execution stucks at "ldr q1, [x0]"?

2 Answers2