I am new to multi threaded programing, and I knew coming into it that there are some weird side affects if you are not careful, but I didn't expect to be THIS puzzled about code I wrote. I am writing what I would think is an obvious beginning/test of threads: just summing up numbers between 0 to x inclusive(of course https://www.reddit.com/r/mathmemes/comments/gq36wb/nn12/ but what I am trying to do is more an exercise of how to use threads instead of how to make that program as fast as possible). I use a function call to create threads based on a hard coded number of cores on the system, and a "boolean" that defines if the processor has multi threaded capabilities. I separate the work into each thread more or less evenly, so each thread sum up between a range, which in theory, if all the threads manage to work together, I could do numcores*normal_computation, which is indeed exciting, and to my surprise, it worked more or less how I expected; until I did some tweaking.
Before continuing, I think a little bit of code will help:
These are the of the pre-processor defines I use in my base code:
#define NUM_CORES 4
#define MULTI_THREADED 1 //1 for true, 0 for false
#define BIGVALUE 1000000000UL
I use this struct to pass in args to my thread oriented function:
typedef struct sum_args
{
int64_t start;
int64_t end;
int64_t return_total;
} sum_args;
This is the function that makes the threads:
int64_t SumUpTo_WithThreads(int64_t limit)
{ //start counting from zero
const int numthreads = NUM_CORES + (int)(NUM_CORES*MULTI_THREADED*0.25);
pthread_t threads[numthreads];
sum_args listofargs[numthreads];
int64_t offset = limit/numthreads; //loss of precision after decimal be careful
int64_t total = 0;
//i < numthread-1 since offset is not assured to be exactly limit/numthreads due to integer division
for (int i = 0; i < numthreads-1; i++)
{
listofargs[i] = (sum_args){.start = offset*i, offset*(i+1)};
pthread_create(&threads[i], NULL, SumBetween, (void *)(&listofargs[i]));
}
//edge case catch
//limit + 1, since SumBetween() is not inclusive of .end aka stops at .end -1 for each loop
listofargs[numthreads-1] = (sum_args){.start = offset*(numthreads-1), .end = limit+1};
pthread_create(&threads[numthreads-1], NULL, SumBetween, (void *)(&listofargs[numthreads-1]));
//finishing
for (int i = 0; i < numthreads; i++)
{
pthread_join(threads[i], NULL); //used to ensure thread is done before adding .return_total
total += listofargs[i].return_total;
}
return total;
}
Here is just a "normal" implementation of summing, just for comparison sake:
int64_t SumUpTo(int64_t limit)
{
uint64_t total = 0;
for (uint64_t i = 0; i <= limit; i++)
total += i;
return total;
}
This is the function the threads run, and it has "two implementations", a for some reason fast implementation, and a for some reason SLOW implementation (this is what I confused on): Extra side note: I use the pre-processor directives just to make the SLOWER and FASTER versions easier to compile.
void* SumBetween(void *arg)
{
#ifdef SLOWER
((sum_args *)arg)->return_total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
((sum_args *)arg)->return_total += i;
#endif
#ifdef FASTER
uint64_t total = 0;
for (int64_t i = ((sum_args *)arg)->start; i < ((sum_args *)arg)->end; i++)
total += i;
((sum_args *)arg)->return_total = total;
#endif
return NULL;
}
And here is my main:
int main(void)
{
#ifdef THREADS
printf("%ld\n", SumUpTo_WithThreads(BIGVALUE));
#endif
#ifdef NORMAL
printf("%ld\n", SumUpTo(BIGVALUE));
#endif
return 0;
}
Here is my compilation (I made sure to set the optimization level to 0, in order to avoid the compiler complepltly optimizing out the stupid summation program, afterall I want to learn about how to use threads!!!):
make faster
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DFASTER -o faster.exe
make slower
clang countV2.c -ansi -std=c99 -Wall -O0 -pthread -DTHREADS -DSLOWER -o slower.exe
clang --version
clang version 10.0.0-4ubuntu1
Target: x86_64-pc-linux-gnu
Thread model: posix
InstalledDir: /usr/bin
And here are the results/differences (note, that the generated code with GCC also had the same side affect):
slower:
sudo time ./slower.exe
500000000500000000
14.63user 0.00system 0:03.22elapsed 453%CPU (0avgtext+0avgdata 1828maxresident)k
0inputs+0outputs (0major+97minor)pagefaults 0swaps
faster:
sudo time ./faster.exe
500000000500000000
2.97user 0.00system 0:00.67elapsed 440%CPU (0avgtext+0avgdata 1708maxresident)k
0inputs+0outputs (0major+83minor)pagefaults 0swaps
Why is using an extra stack defined variable so much faster than just de-referencing the passed in struct pointer!
I tried to find an answer to this question myself. I ended up doing some testing that implemented the same basic/naive summing algorithm from my SumUpTo() function, where the only difference being the data indirection it is dealing with.
Here are the results:
Choose a function to execute!
int64_t sum(void) took: 2.207833 (s) //new stack defined variable, basically a copy of SumUpTo() func
void sumpoint(int64_t *total) took: 2.467067 (s)
void sumvoidpoint(void *total) took: 2.471592 (s)
int64_t sumstruct(void) took: 2.742239 (s)
void sumstructpoint(numbers *p) took: 2.488190 (s)
void sumstructvoidpoint(void *p) took: 2.486247 (s)
int64_t sumregister(void) took: 2.161722 (s)
int64_t sumregisterV2(void) took: 2.157944 (s)
The test resulted in the values I more or less expected. I therefore deduce that it has to be something on top of this idea.
Just to add more information into the mix, I am running Linux, and specifically the Mint distribution.
My processor info is as follows:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU(s): 8
On-line CPU(s) list: 0-7
Thread(s) per core: 2
Core(s) per socket: 4
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel(R) Core(TM) i7-2760QM CPU @ 2.40GHz
Stepping: 7
CPU MHz: 813.451
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 4784.41
Virtualization: VT-x
L1d cache: 128 KiB
L1i cache: 128 KiB
L2 cache: 1 MiB
L3 cache: 6 MiB
NUMA node0 CPU(s): 0-7
Vulnerability Itlb multihit: KVM: Mitigation: Split huge pages
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cach
e flushes, SMT vulnerable
Vulnerability Mds: Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled v
ia prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user
pointer sanitization
Vulnerability Spectre v2: Mitigation; Full generic retpoline, IBPB condit
ional, IBRS_FW, STIBP conditional, RSB filling
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtr
r pge mca cmov pat pse36 clflush dts acpi mmx f
xsr sse sse2 ht tm pbe syscall nx rdtscp lm con
stant_tsc arch_perfmon pebs bts nopl xtopology
nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes
64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xt
pr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_de
adline_timer aes xsave avx lahf_lm epb pti ssbd
ibrs ibpb stibp tpr_shadow vnmi flexpriority e
pt vpid xsaveopt dtherm ida arat pln pts md_cle
ar flush_l1d
If you wish to compile the code yourself, or see the generated assembly for my specific instance please take a look at: https://github.com/spaceface102/Weird_Threads The main source code is "countV2.c" just in case you get lost. Thank you for the help!
/*EOPost*/