Why will for-loop with multithreading not have as great performance as with single-thread?

Question

I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code :

int main()
{
    constexpr int N = 100000;

    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_real_distribution<double> ini(0.0, 10.0);

    // single-thread
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        for(int i = 0; i < N; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "single : " << dur << " ms."<< std::endl;
    }

    // multi-threading (Th is the number of threads)
    for(int Th : {1, 2, 4, 8, 16})
    {
        std::vector<int> vec(N);
        for(int i = 0; i < N; ++i)
        {
            vec[i] = ini(mt);
        }

        auto start = std::chrono::system_clock::now();

        std::vector<std::future<void>> fut(Th);
        for(int t = 0; t < Th; ++t)
        {
            fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{
                for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
                {
                    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
                }
            });
        }
        for(int t = 0; t < Th; ++t)
        {
            fut[t].get();
        }

        auto end = std::chrono::system_clock::now();
        auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
        std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl;
    }

    return 0;
}

The execution environment :

OS : Windows 10 64-bit
Build-system : Visual Studio Community 2015
CPU : Core i5 4210U

When building this program in the Debug mode, the result was as I expected :

single : 146 ms.
Th = 1 : 140 ms.
Th = 2 : 71 ms.
Th = 4 : 64 ms.
Th = 8 : 61 ms.
Th = 16 : 68 ms.

This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance.

However, when in the Release mode, I got a different result (N : 100000 -> 100000000) :

single : 54 ms.
Th = 1 : 443 ms.
Th = 2 : 285 ms.
Th = 4 : 205 ms.
Th = 8 : 206 ms.
Th = 16 : 221 ms.

I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However,

The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?
This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.
CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

Update : I tried to research about vectorization. I enabled /Qvec-report:1 options and got the fact:

//vectorized (when N is large)
for(int i = 0; i < N; ++i)
{
    vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}

//not vectorized
auto lambda = [&vec, &N]{
    for(int i = 0; i < N; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
};
lambda();

//not vectorized
std::vector<std::future<void>> fut(Th);
for(int t = 0; t < Th; ++t)
{
    fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
        for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
        {
            vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
        }
    });
}

and run time :

single (with vectorization) : 47 ms.
single (without vectorization)  : 70 ms.

It was sure that for-loop was not vectorized in multi-threaded version. However, the version needs much time also due to any other reasons.

Update 2 : I rewrote for-loop in the lambda (Type A to Type B) :

//Type A (the previous one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
    for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//Type B (the new one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
    int nb = t * N / Th;
    int ne = (t + 1) * N / Th;
    for(int i = nb; i < ne; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

Type B worked well. The result :

single (vectorized) : 44 ms.
single (invectorized) : 77 ms.
--
Th = 1 (Type A) : 435 ms.
Th = 2 (Type A) : 278 ms.
Th = 4 (Type A) : 219 ms.
Th = 8 (Type A) : 212 ms.
--
Th = 1 (Type B) : 112 ms.
Th = 2 (Type B) : 74 ms.
Th = 4 (Type B) : 60 ms.
Th = 8 (Type B) : 61 ms.

The result of Type B is understandable (multi-threaded codes would run faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, Type A seems to be equivalent to Type B (just using temporary variables) but these show the different performance. The two-types can be considered to generete different assembly codes.

Update 3 : I might find a factor which slowed down the multi-threaded for-loop. It is division in the condition of for. This is single-threaded test :

//ver 1 (ordinary)
fut[t] = std::async(std::launch::async, [&vec, &N]{
    for(int i = 0; i < N; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 2 (introducing a futile variable Q)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
    for(int i = 0; i < N / Q; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 3 (using a temporary variable)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
    int end = N / Q;
    for(int i = 0; i < end; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

//ver 4 (using a raw value)
fut[t] = std::async(std::launch::async, [&vec]{
    for(int i = 0; i < 100000000; ++i)
    {
        vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
    }
});

And running time :

ver 1 : 132 ms.
ver 2 : 391 ms.
ver 3 : 47 ms.
ver 4 : 43 ms.

ver 3 & 4 were well optimazed, and ver 1 was not as much because I think the compiler could not catch N as invariable although N was constexpr. I think ver 2 was very slow because of the same reason. The compiler didn't understand that N and Q wouldn't vary. So the condition i < N / Q would need heavy assembly codes, which slowed down the for-loop.

I think you're relying a bit too much on compiler optimizations for the multi-threaded test. — Mysticial, Feb 27 '16 at 05:41
My arbitrary guess is that the MT version doesn't get vectorized for some reason - maybe, because the loop bounds aren't a compiletime constant and the extra indirection via the lambda. Try to compare your single threaded code with a version exactly like the MT one, except that you are usin deferred launch policy — MikeMB, Feb 27 '16 at 05:43
Is this the exact code you are running? When I try it on MSVS and g++ the whole thing is optimized away and I am getting between 0 and 3 ms for everything. — NathanOliver, Feb 27 '16 at 05:46
If you want to gain speed by parallelizing loops, you should probably use libraries like ppl or tbb or even language extentions (OpenMP) instead of raw c++11 classes anyway — MikeMB, Feb 27 '16 at 05:47
Btw, if nobody has noticed it yet. The lambda captures `Th` by reference. Asking the compiler to prove that `Th` stays constant within each thread might be a long-shot. So you probably have an integer division in the loop condition that's throwing everything off. — Mysticial, Feb 27 '16 at 05:56
@Mysticial You're right. This test is insufficient but this simple process is so unexceptable that it's difficult for me to utilize multithreading (with std::async) :( — m. bs, Feb 27 '16 at 06:12
@Mysticial The indication about `Th` is really right. It was lucky the program ran without problems. — m. bs, Feb 27 '16 at 06:19
@MikeMB I tried to rewirte the for-loop into lambda and compared. It surely tooks more 20~30 ms than non-lambda version. But multi-threaded ran much late and so that doesn't seem a major problem. — m. bs, Feb 27 '16 at 06:29
@MikeMB thanks! I'll consider using the libraries and extensions when needed. — m. bs, Feb 27 '16 at 06:34
@NathanOliver In the Release building, I used N = 100000000 instead of N = 100000, so you need to rewrite it to get the same result. — m. bs, Feb 27 '16 at 06:38
Unless your tests show that using one separate thread has the same performance as using no additional thread (neglecting the startup/shutdown overhead), your test is broken. Fix that first, then look for other things concerning real parallelism. — Ulrich Eckhardt, Feb 27 '16 at 14:26
@m.bs note that `t*N/Th` rolls over. You can test with more threads if you do `chunk=N/Th; t*chunk`. int64 would also serve, obviously. — BitWhistler, Feb 27 '16 at 20:09
I think it doesn't roll over. `N = 100000000` and t = 1 ~ 16, t*N will become never over INT_MAX (2147483647). If it happened, the program would hang up. — m. bs, Feb 28 '16 at 01:15
No longer relevant right now, but what I meant was to use the same code (including `std::async`, partitioning logic, vector of lambdas etc) and just flip the launch policy from `async` to `deferred`. That way it is easier to distinguish the effects of compiler optimization vs. thread creation overhead and other MT problems, like cache thrashing etc. But you pretty much did that by now. Also it would have been interesting to see what the performance was when N is a power of two. — MikeMB, Feb 28 '16 at 07:20

BitWhistler · Accepted Answer · 2016-02-27T21:46:54.577

When you run single threaded, your single thread has vec in the caches, as you've just created it from mt. And it'd keep streaming nicely through caches as it's the only user of all cache levels.
I don't think much vectorization is going on here or you'd get shorter times. I could be wrong, though, as memory bandwidth is the key here. Did you look at the asm?

Any other threads would have to fetch ram. This in itself is not a big issue in your case as it's a single cpu so L3 is shared and the data set is larger than L3 anyway.
BUT, multiple threads fighting for L3 is bad. I think this is the main factor here.
You run too many threads. You should run as many threads as you have cores to pay less for context switching and cache littering.
HT is beneficial when the 2 hw threads have enough "holes" in pipelines (not the case here), BP (not the case here), and in cache utilization (strong case here -> see #1).
I'm actually surprised >2 threads didn't degrade much more --- nowadays cpus are amazing!
Thread launch and term times are less than predictable. If you want more predictability, run the threads constantly and use some cheap signalling to start them and notify they're done.

EDIT: Answers to specific questions

The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?

Not much code here to optimize.... You can break down the long loops to enable loop unrolling:

C = 16; // try other C values?
for(int i=nb; i<ne; i+=C) {
  for(int j=0; j<C; j++)
    vec[i+j] = ...; // that's === vec[i] <<= 2;
}
// need to do the remainder....

You can vectorize by hand if the compiler didn't. Look at the assembli first.

This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.

True. Except that threads may start in their own time. Especially on Windows and especially if there's many of them.

CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?

You always want to use more cpu % for shorter time. I'm not sure what are you seeing as there's no IO here.

I don't quite understand the first part of your answer. The multithreaded measurements start with only one thread and performance is worse by one order of magnitude compared to the non-MT answer. Also performance increases with two and even with four threads (despite the processor having only twp "real" cores) and the performance doesn't even degrade with 8 threads. If the memory system was really the bottleneck, then both effects schould not be observable. Just from looking at the code, I would have thought that it is memory bound too, but the measurements don't support this. — MikeMB, Feb 28 '16 at 07:08
@MikeMB: these numbers are because the for loop recalcs the end condition on every iteration and some thread cre/term noise. I pasted the code here and took me 2 minutes to change to fixed ne, like the OP did now, and got 50 odd ms constant at any Th up to 8. — BitWhistler, Feb 28 '16 at 14:52

Why will for-loop with multithreading not have as great performance as with single-thread?

1 Answers1