7

I was playing around with std::thread and something weird popped up:

#include <thread>

int k = 0;

int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});
    std::thread t2([]() { while (k < 1000000000) { k = k + 1; }});

    t1.join();
    t2.join();

    return 0;
}

When compiling the above code with no optimizations using clang++, I got the following benchmarks:

real 0m2.377s  
user 0m4.688s  
sys  0m0.005s

I then changed my code to the following: (Now using only 1 thread)

#include <thread>

int k = 0;

int main() {
    std::thread t1([]() { while (k < 1000000000) { k = k + 1; }});

    t1.join();

    return 0;
}

And these were the new benchmarks:

real 0m2.304s
user 0m2.298s
sys  0m0.003s

Why is the code utilizing 2 threads slower than the code utilizing 1?

Andy Prowl
  • 124,023
  • 23
  • 387
  • 451
user123
  • 8,970
  • 2
  • 31
  • 52
  • 3
    All the threads are doing is bashing heads trying to read from and write to k. And when the first finishes, it still has to wait for the second. – chris Jun 16 '13 at 20:00
  • @chris: `k` is not `volatile`, so the threads aren't competing, because they effectively aren't sharing `k`. – Mooing Duck Jun 17 '13 at 00:01
  • 1
    @MooingDuck: `volatile` ensures writes to memory, but not having `volatile` doesn't prevent them. The question specifically says "...with no optimisations..." and it's typical for non optimised builds to follow the program instructions very literally and not to "cache" values in registers. On recent Intel/AMD hardware, there's automatic flushing between core caches accessing the same memory address which would slow this down. – Tony Delroy Jun 17 '13 at 01:19
  • @TonyD: That's a good point about the debug builds... – Mooing Duck Jun 17 '13 at 03:39
  • 1
    the compiler would likely not cache a global variable even if you left out volatile. Whats happening is writing old values which moves the count back 1 to 100000's especially if a thread got preempted after reading k but before writing k + 1. – johnnycrash May 15 '14 at 05:53

4 Answers4

16

You have two threads fighting over the same variable, k. So you are spending time where the processors say "Processor 1: Hey, do you know what value k has? Processor 2: Sure, here you go!", ping-ponging back and forth every few updates. Since k isn't atomic, there's also no guarantee that thread2 doesn't write an "old" value of k so that next time thread 1 reads the value, it jumps back 1, 2, 10 or 100 steps, and has to do it over again - in theory that could lead to neither of the loops every finishing, but that would require quite a bit of bad luck.

Mats Petersson
  • 126,704
  • 14
  • 140
  • 227
  • 7
    @Magtheridon96: In addition to what Mats said, your program has a data race and thus formally undefined behaviour. _Anything_ could happen because you access a shared non-atomic variable from two threads without synchronization. The memory model of x86 processors is somewhat lenient but you really really should not rely on that. – JohannesD Jun 16 '13 at 20:42
  • @JohannesD At first, I was testing it with an `std::atomic`, and I got similar results, so I tried with an `int`. Both testcases were faster, but the ratios between the times were about the same. Thanks guys. You've shown me the importance of having atomic storage units. – user123 Jun 16 '13 at 20:45
  • 7
    @Magtheridon96: Simply making `k` an `std::atomic` would prevent the data race, but it would not make the ping pong issue go away because the expression `k = k + 1` would still not be atomic. However, `k++` or `k += 1` would. But the synchronization between processors enforced by the atomic operations, even if lock-free, would still cause overhead - this is simply not a use case that can be sped up using threads. Think of it as trying to write a story with your friend - it would not be faster than doing it alone because you'd have to take turns. – JohannesD Jun 16 '13 at 20:48
  • 2
    @Magtheridon96: `std::atomic` won't help the fact that the data is shared between the two threads - in fact, it makes it a tiny bit worse in the sense that every write on an atomic FORCES the processor that isn't inside the "atomic" part to "give up" its copy of that value and wait for the "atomic" to finish before it can fetch the new value. This is why sharing much data between threads is usually SLOWER than running in a single thread. – Mats Petersson Jun 16 '13 at 20:49
  • @JohannesD Also explains why I got a nice speed boost when I changed `k = k + 1` to `k++`. (From about 1 minute 11 seconds to 35 seconds after changing the code to use an `std::atomic`) – user123 Jun 16 '13 at 20:51
  • @JohannesD +1 for the story analogy. – rwols Jun 16 '13 at 21:08
4

This should really be a comment in reply to Mats Petersson's answer, but I wanted to supply code examples.

The problem is the contention of a specific resource, and also a cacheline.

Alternative 1:

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;

    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&k]() { // capture k by reference so we all use the same k.
           while (k < ITERATIONS) {
               k++;
           }
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here the threads contend for a single variable, performing both read and write which forces it to ping-pong causing contention and making the single threaded case the most efficient.

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;

    std::atomic<uint64_t> k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([&]() {
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here we divide the labor deterministically (we fall afoul of cases where numThreads is not a divisor of ITERATIONS but it's close enough for this demonstration). Unfortunately, we are still encountering contention for access to the shared element in memory.

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>
#include <atomic>

static const uint64_t ITERATIONS = 10000000000ULL;

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;
    std::vector<uint64_t> ks;

    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &ks]() {
           auto& k = ks[t];
           // Imperfect division of labor, we'll fall short in some cases.
           for (size_t i = 0; i < ITERATIONS / numThreads; ++i) {
               k++;
           }
       });
    }

    uint64_t k = 0;
    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
        k += ks[t];
    }
    return 0;
}

Again this is deterministic about the distribution of the workload, and we spend a small amount of effort at the end to collate the results. However we did nothing to ensure the distribution of counters favors healthy CPU distribution. For that:

#include <cstdint>
#include <thread>
#include <vector>
#include <stdlib.h>

static const uint64_t ITERATIONS = 10000000000ULL;
#define CACHE_LINE_SIZE 128

int main(int argc, const char** argv)
{
    size_t numThreads = 1;
    if (argc > 1) {
        numThreads = strtoul(argv[1], NULL, 10);
        if (numThreads == 0)
            return -1;
    }

    std::vector<std::thread> threads;
    std::mutex kMutex;
    uint64_t k = 0;

    for (size_t t = 0; t < numThreads; ++t) {
       threads.emplace_back([=, &k]() {
           alignas(CACHE_LINE_SIZE) uint64_t myK = 0;
           // Imperfect division of labor, we'll fall short in some cases.
           for (uint64_t i = 0; i < ITERATIONS / numThreads; ++i) {
               myK++;
           }
           kMutex.lock();
           k += myK;
           kMutex.unlock();
       });
    }

    for (size_t t = 0; t < numThreads; ++t) {
        threads[t].join();
    }
    return 0;
}

Here we avoid contention between threads down to the cache line level, except for the single case at the end where we use a mutex to control synchronization. For this trivial workload, the mutex is going to have one hell of a relative cost. Alternatively, you could use alignas to provide each thread with its own storage at the outer scope and summarize the results after the joins, eliminating the need for the mutex. I leave that as an exercise to the reader.

kfsone
  • 23,617
  • 2
  • 42
  • 74
  • Thank you for taking the time to do all this :). I like the 3rd variation. – user123 Jun 16 '13 at 22:19
  • 1
    Try the 3rd with an object that's alignas(128) (or whatever your cache line size is) -- the goal being to move the thread's work space to a location that doesn't have *any* overlap with what other CPUs are working on. – kfsone Jun 16 '13 at 22:32
  • Finally at a computer instead of a phone, fixed up the compiler errors. – kfsone Jun 17 '13 at 01:15
2

Seems to me like the more important question than "why didn't this work?" is "How do I get this to work?" For the task at hand, I think std::async (despite significant shortcomings) is really a better tool than using std::thread directly.

#include <future>
#include <iostream>

int k = 0;
unsigned tasks = std::thread::hardware_concurrency();
unsigned reps = 1000000000 / tasks;

int main() {
    std::vector<std::future<int>> f;

    for (int i=0; i<tasks; i++)
        f.emplace_back(std::async(std::launch::async, 
                                  [](){int j; for (j=0; j<reps; j++); return j;})
                      );

    for (int i=0; i<tasks; i++) {
        f[i].wait();
        k += f[i].get();
    }

    std::cout << k << "\n";
    return 0;
}
Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
0

I run into this problem. My opinion is that for certain type of job the cost of managing thread may be more than the benefit you get from running in threads. Here is my code example, doing some real job in a loop large number of iterations, so I got very consistent number with the time command.

   pair<int,int> result{0,0};
#ifdef USETHREAD
      thread thread_l(&Myclass::trimLeft, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.first));
      thread thread_r(&Myclass::trimRight, this, std::ref(fsq), std::ref(oriencnt), std::ref(result.second));
      thread_l.join();
      thread_r.join();
#else
      // non threaded version faster
      trimLeft(fsq, oriencnt, result.first);
      trimRight(fsq, oriencnt, result.second);
#endif

   return result;

The time results

Thead          No_thread
===========================    
Real  4m28s          2m49s
usr   0m55s          2m49s
sys   0m6.2s         0m0.012s

I am ignoring the decimal for seconds for large ones. My code is only updating one shared variable oriencnt. I have not letting it updating the fsq yet. It looks that in threaded version, the system is doing more work which resulted longer clock time (real time). My compiler flag is the default -g -O2, not sure this is the key problem or not. When compiled with -O3 the difference is minimal. There is also some mutex controlled IO operation. My experiment shows that this does not contribute to the difference. I am using gcc 5.4 with C++11. One possibility is that the library is not optimized.

Here is compiled with O3

       Thead    No_thread
=========================
real   4m24        2m44s
usr    0m54s       2m44s
sys    0m6.2s      0m0.016s
Kemin Zhou
  • 6,264
  • 2
  • 48
  • 56