2

I'm experimenting with C++ standard threads. I wrote a small benchmark to test performance overhead and overall throughput. The principle it to run in one or several threads a loop of 1 billion iterations, making small pause from time to time.

In a first version I used counters in shared memory (i.e. normal variables). I exepected the following output:

Sequential      1e+009 loops    4703 ms 212630 loops/ms
2 thrds:t1      1e+009 loops    4734 ms 211238 loops/ms
2 thrds:t2      1e+009 loops    4734 ms 211238 loops/ms
2 thrds:tt      2e+009 loops    4734 ms 422476 loops/ms
manythrd tn     1e+009 loops    7094 ms 140964 loops/ms
...  
manythrd tt     6e+009 loops    7094 ms 845785 loops/ms

Unfortunately the display showed some counters as if they were uninitialised !

I could solve the issue by storing the end value of each counter in an atomic<> for later display. However I do not understand why the version based on simple shared memory does not work properly: each thread uses its own counter, so there is no racing condition. Even the display thread accesses the counters only after the counting threads are finished. Using volatile did not help either.

Could anyone explain me this strange behaviour (as if memory was not updated) and tell me if I missed something ?

Here the shared variables:

const int maxthread = 6;
atomic<bool> other_finished = false;
atomic<long> acounter[maxthread];

Here the code of the threaded function:

void foo(long& count, int ic, long maxcount)   
{
    count = 0;  
    while (count < maxcount) {
        count++;
        if (count % 10000000 == 0)
            this_thread::sleep_for(chrono::microseconds(1));
    }
    other_finished = true;      // atomic: announce work is finished
    acounter[ic] = count;       // atomic: share result 
}

Here an example of how I call benchmark the threads:

mytimer.on();                 // second run, two threadeds
thread t1(foo, counter[0], 0, maxcount);  // additional thread
foo(counter[1], 1, maxcount);         // main thread
t1.join();                    // wait end of additional thread
perf = mytimer.off();     
display_perf("2 thrds:t1", counter[0], perf);  // non atomic version of code
display_perf("2 thrds:t2", counter[1], perf);
display_perf("2 thrds:tt", counter[0] + counter[1], perf);
Nick
  • 7,700
  • 2
  • 29
  • 37
Christophe
  • 68,716
  • 7
  • 72
  • 138
  • Yes ! Sorry: MSVC 2013 on Win 8.1, with an intel i7 – Christophe Jun 26 '14 at 20:05
  • 3
    Most likely not related to the problem. However, regarding performance, you should take a look at [False sharing](http://en.wikipedia.org/wiki/False_sharing), i.e. different threads shouldn't write to variables, that are on the same cache line, in your case `counter`. – nosid Jun 26 '14 at 20:07
  • Very interesting article on false sharing. I suspected something with the cache. However after your solution with std::ref(), I created a vriant of my programme using a global array and without reference passing. This worked fine, which confirmed that the problem was not the cache but the reference. – Christophe Jun 26 '14 at 21:20

1 Answers1

4

Here is a simplified version to reproduce the problem:

void deep_thought(int& value) { value = 6 * 9; }

int main()
{
    int answer = 42;
    std::thread{deep_thought, answer).join();
    return answer; // 42
}

It looks like passing a reference to answer to the worker function, and assigning 6 * 9 to the reference and therefore to answer. However, the constructor of std::thread makes a copy of answer and passes a reference to the copy to the worker function, and the variable answer in the main thread is never changed.

Both GCC-4.9 and Clang-3.5 reject the above code, because the worker function can not be invoked with a lvalue reference. You can solve the problem by passing the variable with std::ref:

    std::thread{deep_thought, std::ref(answer)}.join();
nosid
  • 48,932
  • 13
  • 112
  • 139