No performance difference in different variations of the same program

Question

I copied glibc's implementation of binary search algorithm, then modified it a little bit to suit my needs. I decided to test it and other things I have learned about GCC (attributes and built-ins). The code looks as:

int main() {
  uint_fast16_t a[61] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61 };
  uint64_t t1 = Time(0);
  for(register uint_fast16_t i = 0; i < 10000000; ++i) {
    binary_search(rand() % 62, a, 61);
  }
  printf("%ld\n", Time(0) - t1);
  return 0;
}

Now, this program runs just fine. The problem begins when I add more lines of code, for instance:

uint_fast16_t a[61] __attribute__ ((aligned (64) )) = /* ... */

In this case I would expect faster code, yet performance has not changed after multiple tests (tens of tests). I also tested the program with alignment of 8 and 1 - no changes. I even expected gcc to throw an error/warning, because using alignment less than type size (in my case 64bit machine, uint_fast16_t is 8 bytes), but there was none. Then another change, which was adding caching (introduced in GCC 9). I added the following code before the for loop:

caches(a, uint_fast16_t, uint_fast16_t, 61, 0, 3);
// where "caches" is:
#define caches(x, type, data_type, size, rw, l) ({ \
  for(type Q2W0 = 0; Q2W0 < size; Q2W0 += 64 / sizeof(data_type)) { \
    __builtin_prefetch(x + Q2W0, rw, l); \
  } \
})

No change in performance as well. I figured out maybe my CPU is caching the array automatically after first binary_search so I eliminated the for loop and measure a few times again with and without the caching line, but I have not noticed any change in performance as well.
More information:

Using CentOS8 64bit latest kernel
Using GCC 9.2.1 20191120
Compiling with -O3 -Wall -pthread -lanl -Wno-missing-braces -Wmissing-field-initializers, no errors / warnings during compilation
Things are not optimised away (checked asm output)

I am pretty sure I don't know about something / I am doing something wrong.

Full code available here.

It's almost like your changes didn't improve performance. Are you sure they did? — user253751, Nov 26 '20 at 14:36
Just because you specify a minimum alignment, that doesn't mean that an object *has* to be misaligned. You have a number of ``optimizations in your `main`, but are `binary_search()` and `rand()` so fast that the micro-optimizations matter? You should take a look at the generated assembly. — Thomas Jager, Nov 26 '20 at 14:37
Well that's the whole point. Why didn't they change anything at all? I mean, I did things both to increase the performance and decrease it (increasing alignment and decreasing it, caching and no caching). Maybe the changes I am making really do not matter and that's the reason. — Franciszek Balcerak, Nov 26 '20 at 14:38
Also, your code is broken. `10000000` does not fit in 16 bits, so the loop only ever ends because `uint_fast16_t` is bigger than `uint16_t`. — Thomas Jager, Nov 26 '20 at 14:38
No. I already said that I am running the program on 64bit machine. uint_fast16_t is the fastest type with minimum of 16 bits, and that is 64 bits on 64bit machine. — Franciszek Balcerak, Nov 26 '20 at 14:40
@FranciszekBalcerak That's bad coding. Just because you know it happens to work on your machine doesn't mean you should abuse that. If you need a variable that can count up to `10000000`, use a type you *know* to be at least 32 bits wide. This could be `uint_fast32_t` if you want. That being said, again, these sorts of optimizations aren't useful if what you call within the loop is orders of magnitude slower than what's calling it. — Thomas Jager, Nov 26 '20 at 14:43
Bad is assuming that I am coding for everyone. This code is only supposed to be used by me. I know that not everyone has the same machine as I do. — Franciszek Balcerak, Nov 26 '20 at 14:46
@FranciszekBalcerak Except the person who is you is unlikely to remember all the subtle, hard-coded crap you sneaked into the code when you re-visit it one year from now :) — Lundin, Nov 26 '20 at 14:54

score 1 · Accepted Answer · answered Nov 26 '20 at 15:11

register uint_fast16_t is pre-mature optimization, let the compiler decide which variables to place in registers. Regard register as a mostly obsolete keyword.
As noted in comments, uint_fast16_t i = 0; i < 10000000 is either a bug or bad practice. You should perhaps do something like this instead:
```
const uint_fast16_t MAX = 10000000; 
... i < MAX
```
In which case you should get compiler errors upon initialization, if the value does not fit. Alternatively, check the value with static assertions.

Better yet, use size_t for the loop iterator in this case.
__attribute__ ((aligned (64) )) "In this case I would expect faster code"

Why? What makes you think the array was misaligned to begin with? The compiler will not misalign variables just for the sake of it. Particularly not when the array members are declared as uint_fastnn - the whole point of using uint_fast16_t is in fact to get correct alignment.

In this case, the array results in both gcc and clang for x86/64 to spew out a bunch of .quad assembler instructions, resulting in perfectly aligned data.
Regarding the cache commands, I know too little of how they work to comment on them. It is however likely that you already have ideal data cache performance in this case - the array should be in data cache.

As for instruction cache, it's unlikely to do much good during binary search, which by its nature comes with a tonne of branches. In some cases a brute force linear search might outperform binary search for this very reason. Benchmark and see. (And make sure to bludgeon your old computer science algorithm teacher with a big O when brute force proves to be much faster than binary search.)
rand() % 62 may or may not be quite a bottleneck. Both the rand function and modulus could mean a lot of overhead depending on system.

Faster code, because the memory is then aligned to one cache line. — Franciszek Balcerak, Nov 26 '20 at 15:34
@FranciszekBalcerak I believe the cache line size is 64 bytes on most modern processors you're likely to have in your PC. You have 61 entries. Since your entries happen to be 8 bytes, that's 488 bytes, or 8 cache lines. It's theoretically possible that the object could be aligned just wrong to span 9 cache lines, with no additional alignment forced. — Thomas Jager, Nov 26 '20 at 16:21

No performance difference in different variations of the same program

1 Answers1