This should really be a comment in reply to Mats Petersson's answer, but I wanted to supply code examples.
The problem is the contention of a specific resource, and also a cacheline.
Alternative 1:
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
size_t numThreads = 1;
if (argc > 1) {
numThreads = strtoul(argv[1], NULL, 10);
if (numThreads == 0)
return -1;
}
std::vector<std::thread> threads;
uint64_t k = 0;
for (size_t t = 0; t < numThreads; ++t) {
threads.emplace_back([&k]() { // capture k by reference so we all use the same k.
while (k < ITERATIONS) {
k++;
}
});
}
for (size_t t = 0; t < numThreads; ++t) {
threads[t].join();
}
return 0;
}
Here the threads contend for a single variable, performing both read and write which forces it to ping-pong causing contention and making the single threaded case the most efficient.
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
size_t numThreads = 1;
if (argc > 1) {
numThreads = strtoul(argv[1], NULL, 10);
if (numThreads == 0)
return -1;
}
std::vector<std::thread> threads;
std::atomic<uint64_t> k = 0;
for (size_t t = 0; t < numThreads; ++t) {
threads.emplace_back([&]() {
// Imperfect division of labor, we'll fall short in some cases.
for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
k++;
}
});
}
for (size_t t = 0; t < numThreads; ++t) {
threads[t].join();
}
return 0;
}
Here we divide the labor deterministically (we fall afoul of cases where numThreads is not a divisor of ITERATIONS but it's close enough for this demonstration). Unfortunately, we are still encountering contention for access to the shared element in memory.
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>
static const uint64_t ITERATIONS = 10000000000ULL;
int main(int argc, const char** argv)
{
size_t numThreads = 1;
if (argc > 1) {
numThreads = strtoul(argv[1], NULL, 10);
if (numThreads == 0)
return -1;
}
std::vector<std::thread> threads;
std::vector<uint64_t> ks;
for (size_t t = 0; t < numThreads; ++t) {
threads.emplace_back([=, &ks]() {
auto& k = ks[t];
// Imperfect division of labor, we'll fall short in some cases.
for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
k++;
}
});
}
uint64_t k = 0;
for (size_t t = 0; t < numThreads; ++t) {
threads[t].join();
k += ks[t];
}
return 0;
}
Again this is deterministic about the distribution of the workload, and we spend a small amount of effort at the end to collate the results. However we did nothing to ensure the distribution of counters favors healthy CPU distribution. For that:
#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
static const uint64_t ITERATIONS = 10000000000ULL;
#define CACHE_LINE_SIZE 128
int main(int argc, const char** argv)
{
size_t numThreads = 1;
if (argc > 1) {
numThreads = strtoul(argv[1], NULL, 10);
if (numThreads == 0)
return -1;
}
std::vector<std::thread> threads;
std::mutex kMutex;
uint64_t k = 0;
for (size_t t = 0; t < numThreads; ++t) {
threads.emplace_back([=, &k]() {
alignas(CACHE_LINE_SIZE) uint64_t myK = 0;
// Imperfect division of labor, we'll fall short in some cases.
for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) {
myK++;
}
kMutex.lock();
k += myK;
kMutex.unlock();
});
}
for (size_t t = 0; t < numThreads; ++t) {
threads[t].join();
}
return 0;
}
Here we avoid contention between threads down to the cache line level, except for the single case at the end where we use a mutex to control synchronization. For this trivial workload, the mutex is going to have one hell of a relative cost. Alternatively, you could use alignas to provide each thread with its own storage at the outer scope and summarize the results after the joins, eliminating the need for the mutex. I leave that as an exercise to the reader.