Problems with non-atomic access example on GNU website

Question

On the website of GNU, there is a simple example available that is supposed to demonstrate the problems appearing with non-atomic access. The example contains a small mistake, they have forgotten #include <unistd.h>:

#include <signal.h>
#include <stdio.h>
#include <unistd.h>

struct two_words { int a, b; } memory;

static struct two_words zeros = { 0, 0 }, ones = { 1, 1 };

void handler(int signum)
{
   printf ("%d,%d\n", memory.a, memory.b);
   alarm (1);
}

int main (void)
{
   signal (SIGALRM, handler);
   memory = zeros;
   alarm (1);
   while (1)
     {
    memory = zeros;
    memory = ones;
     }
}

The idea is that the assignment memory = zeros; or the memory = ones; takes multiple cycles and thus the interrupt handler will be able to print "0 1" or "1 0" at some point in time.

However, interestingly for the x86-64 architecture, the assembly code produced by the GCC compiler looks as follows. It appears that the assignment is done within one single cycle by the movq instruction:

    .file   "interrupt_handler.c"
    .text
    .comm   memory,8,8
    .local  zeros
    .comm   zeros,8,8
    .data
    .align 8
    .type   ones, @object
    .size   ones, 8
ones:
    .long   1
    .long   1
    .section    .rodata
.LC0:
    .string "%d,%d\n"
    .text
    .globl  handler
    .type   handler, @function
handler:
.LFB0:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    subq    $16, %rsp
    movl    %edi, -4(%rbp)
    movl    4+memory(%rip), %edx
    movl    memory(%rip), %eax
    movl    %eax, %esi
    leaq    .LC0(%rip), %rdi
    movl    $0, %eax
    call    printf@PLT
    movl    $1, %edi
    call    alarm@PLT
    nop
    leave
    .cfi_def_cfa 7, 8
    ret
    .cfi_endproc
.LFE0:
    .size   handler, .-handler
    .globl  main
    .type   main, @function
main:
.LFB1:
    .cfi_startproc
    pushq   %rbp
    .cfi_def_cfa_offset 16
    .cfi_offset 6, -16
    movq    %rsp, %rbp
    .cfi_def_cfa_register 6
    leaq    handler(%rip), %rsi
    movl    $14, %edi
    call    signal@PLT
    movq    zeros(%rip), %rax
    movq    %rax, memory(%rip)
    movl    $1, %edi
    call    alarm@PLT
.L3:
    movq    zeros(%rip), %rax
    movq    %rax, memory(%rip)
    movq    ones(%rip), %rax
    movq    %rax, memory(%rip)
    jmp .L3
    .cfi_endproc
.LFE1:
    .size   main, .-main
    .ident  "GCC: (Ubuntu 7.3.0-16ubuntu3) 7.3.0"
    .section    .note.GNU-stack,"",@progbits

How is it possible that two separate assignments are done within a single cycle? I would think that the assignment of two different ints has to happen to two different pieces of memory, but somehow it seems over here that they are written to the same place.

This example changes when instead of int, I would use double. In that case, the while loop in assembly becomes:

.L3:
    movq    zeros(%rip), %rax
    movq    8+zeros(%rip), %rdx
    movq    %rax, memory(%rip)
    movq    %rdx, 8+memory(%rip)
    movq    ones(%rip), %rax
    movq    8+ones(%rip), %rdx
    movq    %rax, memory(%rip)
    movq    %rdx, 8+memory(%rip)
    jmp .L3

it's not because it happened to work on a machine that it would always work... here you clearly had "luck" that gcc saw that your structure can fit a register and can be set in one 64bit operation. This does not mean it would always be the case on every arch nor every struct. — OznOg, Oct 17 '18 at 16:35
This example was written a long time ago (in the history of computers), when machines with 64-bit load and store were much less common than they are today. Fatih, would you mind filing a bug report on the glibc documentation (at https://sourceware.org/bugzilla/ ) so we remember to correct it? — zwol, Oct 17 '18 at 16:43
By the way `movq` is only required to be atomic when it does not cross a cache line boundary, so just because it is used does not prove that the assignment is atomic now — harold, Oct 17 '18 at 17:16
Notice that in `handler`, the two halves are loaded separately. If another thread was writing the struct between reads, you could get tearing, but not from a signal handler in the same thread. Also note that this code depends on being compiled without optimization, making every variable effectively `volatile`. If you compile it with optimization, you get an empty infinite loop. https://godbolt.org/z/d6V42Z. Anyway, compile with `-m32 -mno-sse` to stop the compiler from using 64-bit stores for the assignment. — Peter Cordes, Oct 17 '18 at 17:40
Thnx for all the commentary. My question was more related to how it is possible that two ints can be loaded and stored within one instruction in the x86-64 architecture. It is indeed true that if you compile this for other (older) machines the bug/error will still occur. Also zwol suggested to fill in a bug report. I would be happy to do this, but I do not get what this is for? Is the glibc documentation related to this example from GNU somehow? — MasterMind, Oct 17 '18 at 21:49
The two ints are not just randomly in unrelated different memory locations, they are adjacent, which makes it possible to treat both of them together as a single entity in some sense. Similarly you could `memcpy` both of them together with a size of 8 which you may be more familiar with — harold, Oct 18 '18 at 10:03
@Fatih You didn't say where on the "website of GNU" (by which I assume you mean https://www.gnu.org) you found this code, but I recognize it as the example code from . That page is part of the manual for the GNU C Library, "glibc" for short, and problems with that manual should be filed in the bug tracker I linked to earlier. (I am an occasional contributor to glibc.) — zwol, Oct 18 '18 at 14:13

MasterMind · Answer 1 · 2023-05-28T14:58:21.327

1

As it is also pointed out this is an old example. Back in those times, the regular machine had a register width of 32-bit or less. Nowadays, it is very common to have a machine with at least 64-bit register widths. This means that if you have 2 ints of 4 bytes, you have a total of 8 bytes, which is exactly the instruction length of the CPU. Also, you have to notice that the 2 ints are not stored in completely unrelated locations to each other. They can be stored adjacent to each other. This means that the GCC compiler can treat them within one instruction because the 2 ints together fit a register and thus require one 64-bit operation. This is a non-mandatory optimization.

Support for 32-byte SIMD vectors is widespread in x86 CPUs (AVX). Therefore, even with doubles, for certain optimization settings, it is possible optimize and process both doubles in a single instruction. It is possible to try on Godbolt and see that GCC and clang optimize away the non-atomic non-volatile stores inside the loop unless you reduce optimization level to -Og.

However, if you want to be sure that no matter the optimization you still get the described behavior then make it volatile and pad them like struct volatile two_words { int a; _Alignas(64) int b; } memory;

edited May 28 '23 at 14:58

answered May 28 '23 at 12:15

MasterMind

61
7

"instruction length" normally refers to the machine code bytes that encode an instruction. What you're talking about is the *register width* or *operand size* that makes it easy to store or copy 64 bits at once, either with pure integer operations or with a SIMD vector. e.g. x86-64 `add eax, [rdi]` operates on 32-bit data, but the machine-code `03 07` (hex) is 2 bytes long. – Peter Cordes May 28 '23 at 12:19
One `double` is 8 bytes on all mainstream C implementations, in IEEE binary64 format. Support for 32-byte SIMD vectors is widespread in x86 CPUs (AVX), but a lot of programs aren't compiled to use it because there are also lots of x86 CPUs without it (e.g. Pentium / Celeron CPUs as new as Skylake, and low-power Silvermont-family.) Also, 32-byte *atomicity* isn't guaranteed across threads (https://rigtorp.se/isatomic/), but as long as its a single instruction it'll be atomic wrt. signals (interrupts). But support for 16-byte vectors *is* widespread across x86 and ARM. – Peter Cordes May 28 '23 at 12:23
https://godbolt.org/z/7WdhezPGj shows GCC and clang optimizing away the non-atomic non-volatile stores inside the loop unless you reduce optimization level to -Og. Then x86-64 GCC does both stores, each one 16 bytes at a time with aligned stores, which *is* guaranteed atomic on Intel CPUs that support AVX. Only at `-O0` do they do each `double` struct member separately, with 64-bit integer operations, making tearing possible. https://godbolt.org/z/z34PEPYvf – Peter Cordes May 28 '23 at 12:29
Also corrected for the second comment. Thank you for the input. The 3rd comment and thus last remark is a matter of settings and settings definitions. Technically the compiler can optimize and for some settings it indeed does optimize. If you want to make sure that this optimization, no matter the settings, for now, cannot happen at all, the solution is to pick a double so that it does not fit within a register. – MasterMind May 28 '23 at 13:02
Like my 3rd comment showed, 128-bit register width for SIMD registers is baseline for x86-64. That code was using `struct { double a,b; };`. There's nothing stopping compilers from using it unless you did `-mno-sse` or something. `-O0` happens not to with GCC or clang, but that's a choice. To make it actually impossible, pad them like `struct { int a; _Alignas(64) int b; };` so the whole struct is 128 **bytes** in size, and the two values are 64 bytes apart. Current CPUs have at the widest 512-bit (64-byte) vector registers, and unlike ARM, x86 doesn't have load-pair / store-pair insns. – Peter Cordes May 28 '23 at 13:11
I see now, thank you for the comment. I will use it in the answer. – MasterMind May 28 '23 at 13:18
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/253864/discussion-between-mastermind-and-peter-cordes). – MasterMind May 28 '23 at 13:29

Problems with non-atomic access example on GNU website

1 Answers1

Linked