I believed it was better to process simple and heavy works (ex. matrix-calculation) with multi-threading than with single-thread, so I tested the following code :
int main()
{
constexpr int N = 100000;
std::random_device rd;
std::mt19937 mt(rd());
std::uniform_real_distribution<double> ini(0.0, 10.0);
// single-thread
{
std::vector<int> vec(N);
for(int i = 0; i < N; ++i)
{
vec[i] = ini(mt);
}
auto start = std::chrono::system_clock::now();
for(int i = 0; i < N; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
auto end = std::chrono::system_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "single : " << dur << " ms."<< std::endl;
}
// multi-threading (Th is the number of threads)
for(int Th : {1, 2, 4, 8, 16})
{
std::vector<int> vec(N);
for(int i = 0; i < N; ++i)
{
vec[i] = ini(mt);
}
auto start = std::chrono::system_clock::now();
std::vector<std::future<void>> fut(Th);
for(int t = 0; t < Th; ++t)
{
fut[t] = std::async(std::launch::async, [t, &vec, &N, &Th]{
for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
}
for(int t = 0; t < Th; ++t)
{
fut[t].get();
}
auto end = std::chrono::system_clock::now();
auto dur = std::chrono::duration_cast<std::chrono::milliseconds>(end - start).count();
std::cout << "Th = " << Th << " : " << dur << " ms." << std::endl;
}
return 0;
}
The execution environment :
OS : Windows 10 64-bit
Build-system : Visual Studio Community 2015
CPU : Core i5 4210U
When building this program in the Debug mode, the result was as I expected :
single : 146 ms.
Th = 1 : 140 ms.
Th = 2 : 71 ms.
Th = 4 : 64 ms.
Th = 8 : 61 ms.
Th = 16 : 68 ms.
This says that the code not using std::async justly has same performance as one using one-thread and when using 4 or 8 threads I can get great performance.
However, when in the Release mode, I got a different result (N : 100000 -> 100000000) :
single : 54 ms.
Th = 1 : 443 ms.
Th = 2 : 285 ms.
Th = 4 : 205 ms.
Th = 8 : 206 ms.
Th = 16 : 221 ms.
I'm wondering this result. Just for the latter half codes, multi-threading just has better performance than single. But the fastest one is the first half codes, which do not use std::async. I know the fact that optimization and overhead around multithreading has much effect on the performance. However,
- The process is just calculation of the vector, so what can be optimized not in the multi-thread codes but in the single-thread codes?
- This program contains nothing about mutex or atomic etc, and data conflict might not occur. I think overheads around multithreading would be relatively small.
- CPU utilization in the codes not using std::async is smaller than in the multi-threading codes. Is it efficient to use the large part of CPU?
Update : I tried to research about vectorization. I enabled /Qvec-report:1
options and got the fact:
//vectorized (when N is large)
for(int i = 0; i < N; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
//not vectorized
auto lambda = [&vec, &N]{
for(int i = 0; i < N; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
};
lambda();
//not vectorized
std::vector<std::future<void>> fut(Th);
for(int t = 0; t < Th; ++t)
{
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
}
and run time :
single (with vectorization) : 47 ms.
single (without vectorization) : 70 ms.
It was sure that for-loop was not vectorized in multi-threaded version. However, the version needs much time also due to any other reasons.
Update 2 : I rewrote for-loop in the lambda (Type A to Type B) :
//Type A (the previous one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
for(int i = t*N / Th; i < (t + 1)*N / Th; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
//Type B (the new one)
fut[t] = std::async(std::launch::async, [t, &vec, &N, Th]{
int nb = t * N / Th;
int ne = (t + 1) * N / Th;
for(int i = nb; i < ne; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
Type B worked well. The result :
single (vectorized) : 44 ms.
single (invectorized) : 77 ms.
--
Th = 1 (Type A) : 435 ms.
Th = 2 (Type A) : 278 ms.
Th = 4 (Type A) : 219 ms.
Th = 8 (Type A) : 212 ms.
--
Th = 1 (Type B) : 112 ms.
Th = 2 (Type B) : 74 ms.
Th = 4 (Type B) : 60 ms.
Th = 8 (Type B) : 61 ms.
The result of Type B is understandable (multi-threaded codes would run faster than single-threaded invectorized codes, and not as fast as vectorized codes). On the other hand, Type A seems to be equivalent to Type B (just using temporary variables) but these show the different performance. The two-types can be considered to generete different assembly codes.
Update 3 : I might find a factor which slowed down the multi-threaded for-loop. It is division in the condition of for
. This is single-threaded test :
//ver 1 (ordinary)
fut[t] = std::async(std::launch::async, [&vec, &N]{
for(int i = 0; i < N; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
//ver 2 (introducing a futile variable Q)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
for(int i = 0; i < N / Q; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
//ver 3 (using a temporary variable)
int Q = 1;
fut[t] = std::async(std::launch::async, [&vec, &N, Q]{
int end = N / Q;
for(int i = 0; i < end; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
//ver 4 (using a raw value)
fut[t] = std::async(std::launch::async, [&vec]{
for(int i = 0; i < 100000000; ++i)
{
vec[i] = 2 * vec[i] + 3 * vec[i] - vec[i];
}
});
And running time :
ver 1 : 132 ms.
ver 2 : 391 ms.
ver 3 : 47 ms.
ver 4 : 43 ms.
ver 3 & 4 were well optimazed, and ver 1 was not as much because I think the compiler could not catch N as invariable although N was constexpr
. I think ver 2 was very slow because of the same reason. The compiler didn't understand that N and Q wouldn't vary. So the condition i < N / Q
would need heavy assembly codes, which slowed down the for-loop.