Worker threads much worse performance than main

Question

I have a small test program that compiles alongside my library for testing the speeds of various mathematics functions when using different methods (SSE, for-loop, unrolled-loop, ect). These test run over the different methods hundreds of thousands of times and work out the mean of computation time taken. I decided that I would create 4 worker threads for each core of my computer and run the benchmarks that way for my tests.

Now these are micro-benchmarks, measured in nano-seconds, so differences may seem large but there is no other kind of difference at that level really.

Here is my code for running over functions in a single-threaded fashion:

static constexpr std::size_t num_tests = 400000;
auto do_test = [=](uint64_t(*test)()){
    // test is a function that returns nanosecods taken for a specific method
    uint64_t accum = 0;
    for(std::size_t n = 0; n < num_tests; n++)
        accum += test();
    return accum / num_tests;
};

and here is my (faster) code for running over tests in a multi-threaded fashion:

static constexpr std::size_t num_tests = 100000;
auto do_test = [=](uint64_t(*test)()){
    uint64_t accum = 0;

    std::thread first([&](){
        for(std::size_t n = 0; n < num_tests; n++)
            accum += test();
    });
    std::thread second([&](){
        for(std::size_t n = 0; n < num_tests; n++)
            accum += test();
    });
    std::thread third([&](){
        for(std::size_t n = 0; n < num_tests; n++)
            accum += test();
    });
    std::thread fourth([&](){
        for(std::size_t n = 0; n < num_tests; n++)
            accum += test();
    });

    first.join();
    second.join();
    third.join();
    fourth.join();

    return accum / (num_tests * 4);
};

BUT the results are slower D: so it executes faster, but the operations give slower results.

My single threaded version gives a mean of 77 nanoseconds whereas my multithreaded version gives a mean of 150 nanoseconds for the operations!

Why would this be?

P.S. I know it's a minuscule difference, I just thought it was interesting.

You should measure how long it takes to start a thread too. It's not free. — nos, Apr 24 '14 at 07:22
Makes no difference, the timer starts in the `test` function before the method. — RamblingMad, Apr 24 '14 at 07:23
One problem is that you have four threads all modifying the same variable (`accum`) without protection. This may cause unexpected things to happen. Use four separate variables instead, and add them together after the threads are finished. — Some programmer dude, Apr 24 '14 at 07:35
join() calls are nmot free either. Try setting num_tests to 40000000/10000000. Also what @JoachimPileborg says. — Martin James, Apr 24 '14 at 08:14
In the multithreaded version, each iteration requires a full cache flush since the variable is shared. Those are far from free. Try with one local accumulator per thread, writing the result to the shared one after the loop. — molbdnilo, Apr 24 '14 at 08:24
This is a [*false sharing*](http://en.wikipedia.org/wiki/False_sharing) problem. Except that it isn't false :) — Hans Passant, Apr 24 '14 at 09:08
@HansPassant, is it false sharing or just a race condition. `accum` is shared by each thread so they are each competing to write to the same value. False sharing would be if they were each trying to write to different values in the same cache-line. — Z boson, Apr 24 '14 at 09:10
But they are all trying to write a different value to the same cache line. — Hans Passant, Apr 24 '14 at 09:16
@HansPassant Maybe I don't understand something from `std::thread` (I have never used it) but in OpenMP `accum` would be a shared variable at the same memory address for each thread. The OP is trying to do a parallel reduction. — Z boson, Apr 24 '14 at 09:20
@HansPassant, by same value I mean memory address (value was poor word choice). So each thread is trying to write to the same memory address. False sharing is when each thread is trying to write to different memory addresses in the same cache line. — Z boson, Apr 24 '14 at 10:48

Worker threads much worse performance than main

0 Answers0