What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

Question

I wrote this microbenchmark to better understand Go's performance characteristics, so that I would be able to make intelligent choices as to when to use it.

I thought this would be the ideal scenario for Go, from the performance overhead point-of-view:

no allocations / deallocations inside the loop
array access clearly within bounds (bounds checks could be removed)

Still, I'm seeing an exactly 4-fold difference in speed relative to gcc -O3 on AMD64. Why is that?

(Timed using the shell. Each takes a few seconds, so the startup is negligible)

package main

import "fmt"

func main() {
    fmt.Println("started");

    var n int32 = 1024 * 32

    a := make([]int32, n, n)
    b := make([]int32, n, n)

    var it, i, j int32

    for i = 0; i < n; i++ {
        a[i] =  i
        b[i] = -i
    }

    var r int32 = 10
    var sum int32 = 0

    for it = 0; it < r; it++ {
        for i = 0; i < n; i++ {
            for j = 0; j < n; j++ {
                sum += (a[i] + b[j]) * (it + 1)
            }
        }
    }
    fmt.Printf("n = %d, r = %d, sum = %d\n", n, r, sum)
}

The C version:

#include <stdio.h>
#include <stdlib.h>


int main() {
    printf("started\n");

    int32_t n = 1024 * 32;

    int32_t* a = malloc(sizeof(int32_t) * n);
    int32_t* b = malloc(sizeof(int32_t) * n);

    for(int32_t i = 0; i < n; ++i) {
        a[i] =  i;
        b[i] = -i;
    }

    int32_t r = 10;
    int32_t sum = 0;

    for(int32_t it = 0; it < r; ++it) {
        for(int32_t i = 0; i < n; ++i) {
            for(int32_t j = 0; j < n; ++j) {
                sum += (a[i] + b[j]) * (it + 1);
            }
        }
    }
    printf("n = %d, r = %d, sum = %d\n", n, r, sum);

    free(a);
    free(b);
}

Updates:

Using range, as suggested, speeds Go up by a factor of 2.
On the other hand, -march=native speeds C up by a factor of 2, in my tests. (And -mno-sse gives a compile error, apparently incompatible with -O3)
GCCGO seems comparable to GCC here (and does not need range)

Looking at the [codegen](https://gcc.godbolt.org/z/WrqcEx4TK), it seems that Go did not remove the bound checking. Try to move `a[i]` one level higher to see if it makes any difference. — Ecir Hana, Feb 28 '23 at 20:47
Make a go benchmark so you can see actual allocations and amortized time. — JimB, Feb 28 '23 at 20:47
re: EcirHana's point, use `range` to reliably elide bounds checking — JimB, Feb 28 '23 at 20:48
Don't use the shell for timing. Go runtime startup is longer. Measure the for-loop. — Burak Serdar, Feb 28 '23 at 20:59
@BurakSerdar on my machine, the startup is 7000x faster (which I can measure by changing `n` and `r` to tiny values) — MWB, Feb 28 '23 at 21:03
You have lots of noise. For one, Go runtime starts with at least three goroutines. Go is statically linked, C is not. Measure the for-loop if you want to compare the performance of the for-loops. — Burak Serdar, Feb 28 '23 at 21:14
I saw some attempt at SIMD in the C version [as compiled by GCC -O3](https://godbolt.org/z/offEc9WWG) (though it doesn't look ideal to me), the [Go version](https://godbolt.org/z/3e3Ecrzs9) by comparison looks like nonsense, as if a 30 year old compiler was used, is there some compilation flag that I was supposed to use? — harold, Feb 28 '23 at 21:20
@harold You can *disable* optimizations in Go: https://stackoverflow.com/questions/45003259/passing-an-optimization-flag-to-a-go-compiler — MWB, Feb 28 '23 at 21:38
@BurakSerdar The noise comes from the OS, other processes, etc. Eliminating 0.003s from a 14.000s runtime does not eliminate the noise. — MWB, Mar 01 '23 at 01:20
@harold I'm hoping someone who understands assembly will explain what GCC does that Go does not, in an accessible manner. (Side note: given Google's investment in hardware, it's weird that they didn't build Go on top of, say, LLVM, enabling its optimizations as an option) — MWB, Mar 01 '23 at 03:17
@MWB Google never intended Go to be comparable to C++ for performance. It's meant to be a more appealing alternative to Python, for anything where performance matters at all they still use C++. — user229044, Mar 01 '23 at 03:37
I'm curious what error message you get with `-mno-sse`. It works for me on x86-64 Arch GNU/Linux, and https://godbolt.org/z/4EEPWsjG8. There's no `float` or `double`, and no header that defines inline functions using those types; that would make it impossible for the compiler to follow the calling convention. If you're on a non-x86 system then of course `-mno-sse` isn't a valid option, but you say you're on AMD64. Not that I'd recommend `-mno-sse` for this; just use `-fno-tree-vectorize` to only prevent vectorization. — Peter Cordes, Mar 01 '23 at 10:02
@PeterCordes I'm using 10.2 (AMD64 Debian Stable). I guess this got fixed, whatever it was. The error message points to `x86_64-linux-gnu/bits/stdlib-float.h:26` saying *"error: SSE register return with SSE disabled"* — MWB, Mar 01 '23 at 20:16
@MWB: Interesting. The definition of `__extern_inline double __NTH (atof (const char *__nptr))` is still there in the header on my system, but GCC12 doesn't complain as long as it's not called. https://godbolt.org/z/den4MGf81 confirms that GCC10 and earlier error on that `atof` definition with `-mno-sse`, but GCC11 and later don't. The preprocessed source on godbolt is the same, so it's a compiler difference, not a header difference. GCC11 might have been when GCC changed to not emit a definition for `extern inline` C functions, so you need to instantiate them in exactly one `.c` file. — Peter Cordes, Mar 01 '23 at 20:34

Dolda2000 · Accepted Answer · 2023-03-01T03:57:36.783

2

Looking at the assembler output of the C program vs the Go program, at least on the versions of Go and GCC that I use (1.19.6 and 12.2.0, respectively), the immediate and obvious difference is that GCC has auto-vectorized the C program, whereas the Go compiler does not seem to have been capable of that.

That also explains fairly well why you're seeing exactly a four-fold increase in performance, since GCC, when not targeting a specific architecture, uses SSE rather than AVX, meaning four times the width of scalar instructions for 32-bit operations. In fact, adding -march=native adds another two-fold performance increase for me, since that makes GCC output AVX code on my CPU.

I'm not intimate enough with Go to be able to tell you whether the Go compiler is intrinsically incapable of autovectorization or if it's just this particular program that trips it up for some reason, but nevertheless that seems to be the fundamental reason.

edited Mar 01 '23 at 03:57

answered Mar 01 '23 at 03:41

Dolda2000

25,216
4
51
92

*"four-fold"* -- as I mentioned in the comments, this becomes two-fold when `range` is used, eliminating the bounds checks, supposedly. – MWB Mar 01 '23 at 03:43
1

Well, in fact I have to retract my comments about `-march=native` using AVX instructions, too. I hadn't actually looked at the assembly output of that compilation when I wrote it, but looking at it, it turns out to in fact still use SSE, so the additional improvements must be attributable solely to better instruction scheduling. Nevertheless, the C program is autovectorized, and the Go program is not, it simply seems to be a difference in the competence of the respective compilers. – Dolda2000 Mar 01 '23 at 03:46
By the way, one thing you can try to verify on your own is to pass `-mno-sse` to GCC. That will keep general optimizations on high while not using and vector instructions. – Dolda2000 Mar 01 '23 at 03:52
@Dolda2000: `-march=native` should use AVX instructions if your CPU supports them. It might not use 256-bit vector width, depending on tuning options for `-mtune=native`, e.g. Bulldozer-family and Zen1 default to `-mprefer-vector-width=128`. https://godbolt.org/z/YjfajxY8a . But you can still see it using `vpmulld`, the AVX1 form of the SSE4.1 instruction. Even if your CPU (or VM) doesn't support AVX, but does support SSE4.1, `pmulld` to vectorize the 32x32 => 32-bit multiplies will be an improvement over shuffling around `vpmuludq` which does packed 32x32 => 64-bit multiplies. – Peter Cordes Mar 01 '23 at 05:53
1

If you want to disable auto-vectorization but not block other stuff that uses vector regs (like scalar FP math or efficient struct / local array init), use `gcc -O3 -fno-tree-vectorize` or `clang -O3 -fno-vectorize`. – Peter Cordes Mar 01 '23 at 05:54
BTW, Intel CPUs from Haswell onward run `pmulld` as two dependent uops, presumably doing odd/even elements using the 52-bit mantissa multipliers per 64-bit element, since the 32-bit float mantissa multipliers aren't wide enough. `pmuludq` is only one uop. However, AMD CPUs run `pmulld` as a single uop, with Zen3 and later having it fully pipelined with the same throughput as `pmuludq`, so SSE4.1 or AVX1 could explain a 2x if SIMD integer multiply execution units were the bottleneck both times. (Zen1/2 have lower tput; the 1 uop occupies the port for longer) https://uops.info/ – Peter Cordes Mar 01 '23 at 05:58

What's costing Go a factor of 4 in performance in this array access microbenchmark (relative to GCC)?

1 Answers1