Console output order slows down multi-threaded program

Question

When compiling the following code

#include <iostream>
#include <vector>
#include <thread>
#include <chrono>
#include <mutex>

std::mutex cout_mut;

void task()
{
    for(int i=0; i<10; i++)
    {
        double d=0.0;
        for(size_t cnt=0; cnt<200000000; cnt++) d += 1.23456;

        std::lock_guard<std::mutex> lg(cout_mut);
        std::cout << d << "(Help)" << std::endl;
        //        std::cout << "(Help)" << d << std::endl;
    }
}

int main()
{
    std::vector<std::thread> all_t(std::thread::hardware_concurrency());

    auto t_begin = std::chrono::high_resolution_clock::now();

    for(auto& t : all_t) t = std::thread{task};
    for(auto& t : all_t) t.join();

    auto t_end = std::chrono::high_resolution_clock::now();

    std::cout << "Took : " << (t_end - t_begin).count() << std::endl;
}

Under MinGW 4.8.1 it takes roughly 2.5 seconds to execute on my box. That is approximately the time it takes to only execute the task function single-threadedly.

However, when I uncomment the line in the middle and therefore comment out the line before (that is, when I exchange the order in which d and "(Help)" are written to std::cout) the whole thing takes now 8-9 seconds.

What is the explanation?

I tested again and found out that I only have the problem with MinGW-build x32-4.8.1-win32-dwarf-rev3 but not with MinGW build x64-4.8.1-posix-seh-rev3. I have a 64-bit machine. With the 64-bit compiler both versions take three seconds. However, using the 32-bit compiler, the problem remains (and is not due to release/debug version confusion).

Could it somehow have happened that you got the fast time from a release build and the slow time from a debug build? I don't get any significant difference by the swap, with gcc 4.7.2/4.8.1, clang 3.2/3.3 (all Linux). But for release builds I get about 2 secs and for debug I get about 8. Notably similar to the ratio you are seeing (on Windows, on some other machine). — Mike Kinghan, Aug 04 '13 at 22:19
No, it is not a debug/release issue; please see my edit above. — levzettelin, Aug 05 '13 at 08:46
It seems a difference of x87 fpu for 32-bit and SSE2 for 64-bit. The 32-bit compiler may generate (slow) x87 instruction codes of `double` operations. — yohjp, Aug 12 '13 at 00:34
It looks like gcc 4.8.1 produces very different assembly for the inner loop. In one case it is just an addsd of the constant into xmm0. In the other case the intermediary result is accumulated on the stack, generating two extra movsd instructions in addition to the addsd. — Come Raczy, Aug 12 '13 at 14:43

score 1 · Accepted Answer · answered Aug 13 '13 at 07:36

It has nothing to do with multi-threading. It is a problem of loop optimization. I have rearranged the original code to get something minimalistic demonstrating the issue:

#include <iostream>
#include <chrono>
#include <mutex>

int main()
{
    auto t_begin = std::chrono::high_resolution_clock::now();
    for(int i=0; i<2; i++)
    {
        double d=0.0;
        for(int j=0; j<100000; j++) d += 1.23456;
        std::mutex mutex;
        std::lock_guard<std::mutex> lock(mutex);
#ifdef SLOW
        std::cout << 'a' << d << std::endl;
#else
        std::cout << d << 'a' << std::endl;
#endif
    }
    auto t_end = std::chrono::high_resolution_clock::now();
    std::cout << "Took : " << (static_cast<double>((t_end - t_begin).count())/1000.0) << std::endl;
}

When compiled and executed and with:

g++ -std=c++11 -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output is:

a123456
a123456
Took : 931
123456a
123456a
Took : 373

Most of the difference in timing is explained by the assembly code generated for the inner loop: the fast case accumulates directly in xmm0 while the slow case accumulates into xmm1 - leading to 2 extra movsd instructions.

Now, when compiled with the '-ftree-loop-linear' option:

g++ -std=c++11 -ftree-loop-linear -DSLOW -o slow -O3 b.cpp -lpthread ; g++ -std=c++11 -ftree-loop-linear -o fast -O3 b.cpp -lpthread ; ./slow ; ./fast

The output becomes:

a123456
a123456
Took : 340
123456a
123456a
Took : 346

Console output order slows down multi-threaded program

1 Answers1