I have no idea why changing variable access/storage type in pthread subroutine sharply increases perfromance

Question

I am new to multi threaded programing, and I knew coming into it that there are some weird side affects if you are not careful, but I didn't expect to be THIS puzzled about code I wrote. I am writing what I would think is an obvious beginning/test of threads: just summing up numbers between 0 to x inclusive(of course https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/ but what I am trying to do is more an exercise of how to use threads instead of how to make that program as fast as possible). I use a function call to create threads based on a hard coded number of cores on the system, and a "boolean" that defines if the processor has multi threaded capabilities. I separate the work into each thread more or less evenly, so each thread sum up between a range, which in theory, if all the threads manage to work together, I could do numcores*normal_computation, which is indeed exciting, and to my surprise, it worked more or less how I expected; until I did some tweaking.

Before continuing, I think a little bit of code will help:

These are the of the pre-processor defines I use in my base code:

#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true, 0 for false
#define BIGVALUE 1000000000UL

I use this struct to pass in args to my thread oriented function:

typedef struct sum_args
{
    int64_t start;
    int64_t end;
    int64_t return_total;
} sum_args;

This is the function that makes the threads:

int64_t SumUpTo_WithThreads(int64_t limit)
{   //start counting from zero
    const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
    pthread_t threads[numthreads];
    sum_args listofargs[numthreads];
    int64_t offset = limit/numthreads; //loss of precision after decimal be careful
    int64_t total = 0;

    //i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
    for (int i = 0; i < numthreads-1; i++)
    {
        listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
        pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
    }
    //edge case catch
    //limit + 1, since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
    listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
    pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));

    //finishing
    for (int i = 0; i < numthreads; i++)
    {
        pthread_join(threads[i], NULL); //used to ensure thread is done before adding .return_total
        total += listofargs[i].return_total;
    }

    return total;
}

Here is just a "normal" implementation of summing, just for comparison sake:

int64_t SumUpTo(int64_t limit)
{
    uint64_t total = 0;
    for (uint64_t i = 0; i <= limit; i++)
        total += i;
    return total;
}

This is the function the threads run, and it has "two implementations", a for some reason fast implementation, and a for some reason SLOW implementation (this is what I confused on): Extra side note: I use the pre-processor directives just to make the SLOWER and FASTER versions easier to compile.

void* SumBetween(void *arg)
{
    #ifdef SLOWER
    ((sum_args *)arg)->return_total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        ((sum_args *)arg)->return_total += i;
    #endif

    #ifdef FASTER
    uint64_t total = 0;
    for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
        total += i;
    ((sum_args *)arg)->return_total = total;
    #endif
    
    return NULL;
}

And here is my main:

int main(void)
{
    #ifdef THREADS
    printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
    #endif

    #ifdef NORMAL
    printf("%ld\n", SumUpTo(BIGVALUE));
    #endif 
    return 0;
}

Here is my compilation (I made sure to set the optimization level to 0, in order to avoid the compiler complepltly optimizing out the stupid summation program, afterall I want to learn about how to use threads!!!):

make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe

make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe

clang --version
clang version 10.0.0-4ubuntu1 
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin

And here are the results/differences (note, that the generated code with GCC also had the same side affect):

slower:
sudo time ./slower.exe 
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps

faster:
sudo time ./faster.exe 
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps

Why is using an extra stack defined variable so much faster than just de-referencing the passed in struct pointer!

I tried to find an answer to this question myself. I ended up doing some testing that implemented the same basic/naive summing algorithm from my SumUpTo() function, where the only difference being the data indirection it is dealing with.

Here are the results:

Choose a function to execute!

int64_t sum(void) took: 2.207833 (s) //new stack defined variable, basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)

The test resulted in the values I more or less expected. I therefore deduce that it has to be something on top of this idea.

Just to add more information into the mix, I am running Linux, and specifically the Mint distribution.

My processor info is as follows:

Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU(s):                          8
On-line CPU(s) list:             0-7
Thread(s) per core:              2
Core(s) per socket:              4
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           42
Model name:                      Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
Stepping:                        7
CPU MHz:                         813.451
CPU max MHz:                     3500.0000
CPU min MHz:                     800.0000
BogoMIPS:                        4784.41
Virtualization:                  VT-x
L1d cache:                       128 KiB
L1i cache:                       128 KiB
L2 cache:                        1 MiB
L3 cache:                        6 MiB
NUMA node0 CPU(s):               0-7
Vulnerability Itlb multihit:     KVM: Mitigation: Split huge pages
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cach
                                 e flushes, SMT vulnerable
Vulnerability Mds:               Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
                                 ia prctl and seccomp
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user
                                  pointer sanitization
Vulnerability Spectre v2:        Mitigation; Full generic retpoline, IBPB condit
                                 ional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds:             Not affected
Vulnerability Tsx async abort:   Not affected
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtr
                                 r pge mca cmov pat pse36 clflush dts acpi mmx f
                                 xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
                                 stant_tsc arch_perfmon pebs bts nopl xtopology 
                                 nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
                                 64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
                                 pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
                                 adline_timer aes xsave avx lahf_lm epb pti ssbd
                                  ibrs ibpb stibp tpr_shadow vnmi flexpriority e
                                 pt vpid xsaveopt dtherm ida arat pln pts md_cle
                                 ar flush_l1d

If you wish to compile the code yourself, or see the generated assembly for my specific instance please take a look at: https://github.com/spaceface102/Weird_Threads The main source code is "countV2.c" just in case you get lost. Thank you for the help!

/*EOPost*/

Generate disassemblies of the two different function versions (compiled SLOWER vs FASTER). — WhozCraig, Jun 12 '21 at 10:38
More than one parameter struct fits in the same cache line, and without optimization (`-O0`) the compiler doesn't optimize the sum into a register. So you have 2 or 3 threads trying to store/reload to different parts of the same cache line at the same time, causing contention. (This is called false sharing.) This perf diff would go away if benchmarked with optimization enabled. ([Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394)) — Peter Cordes, Jun 12 '21 at 10:40
Or put `alignas(64)` (stdalign.h) on the first struct member, as shown in [Cache lines, false sharing and alignment](https://stackoverflow.com/q/18236603), so each struct goes in a separate cache line. That would avoid false sharing, even if you compile inefficiently so that it still stores every iteration. (And BTW, `size_t` would be a more normal choice for the start/end indices, not `int64_t`). — Peter Cordes, Jun 12 '21 at 10:45
Also note that the CPU pipeline is able to hide a significant amount of the false sharing problem; otherwise the slowdown factor would be an order of magnitude higher. It can do store-forwarding even while it doesn't own the line, and commit a bunch of stores from the store buffer when it does gains ownership. ([Why does false sharing still affect non atomics, but much less than atomics?](https://stackoverflow.com/q/61672049)) — Peter Cordes, Jun 12 '21 at 10:48
Anyway, as [False sharing over multiple cores](https://stackoverflow.com/q/52612234) says, "**Typical bad patterns are writes to `vector[my_thread_index]`.**" — Peter Cordes, Jun 12 '21 at 10:50
@PeterCordes implied this, but I'll say it directly: benchmarking with optimization disabled is an utter waste of time. If you don't have optimization enabled, you don't really care how fast things run. — Andrew Henle, Jun 12 '21 at 10:56
@AndrewHenle: That's true in general. In this case, it's a way to reveal an interesting cpu-architecture effect that you wouldn't see with such simple code otherwise. (Unless you used `volatile` or `_Atomic`.) Oh, I hadn't realized this was just summing `i`, not `arr[i]`. I was doing to say "if the compiler's alias analysis failed to realize that arg->result couldn't overlap with the array", but there's no other memory being read around these writes, except local vars. And clang -O2 would have turned this loop into the closed-form formula, making it 100% pointless as a pthreads test case. — Peter Cordes, Jun 12 '21 at 11:01
@AndrewHenle: I guess the key point is that timing with `-O0` tells you nothing about "how fast your code is" in any kind of absolute or even relative sense (`-O0` code has *different* bottlenecks from optimized). But that doesn't rule out asking about some perf effect you see with source changes at -O0, as long as you realize it's going to be a CPU-architecture question (which this was already tagged). e.g. SnB variable-latency store-forwarding - [Adding a redundant assign speeds up ... without opt](//stackoverflow.com/q/49189685) which was salvaged by turning into an asm question — Peter Cordes, Jun 12 '21 at 11:09
-O0 does make sense while learning to use pthreads, and lets you use really dumb functions as your workload without even having to try to get the compiler not to optimize them away. Oh and BTW, re: clang optimizing this loop into the closed form with a few instructions: that was one of the demos in Matt Godbolt's CppCon2017 talk “[What Has My Compiler Done for Me Lately? Unbolting the Compiler's Lid](https://youtu.be/bSkpMdDe4g4)” — Peter Cordes, Jun 12 '21 at 11:11

I have no idea why changing variable access/storage type in pthread subroutine sharply increases perfromance

0 Answers0