Even without contention, the scalability of std::mutex
seems to be horrible. This is a case where every thread is guaranteed to use its own mutex. What is going on?
#include <mutex>
#include <vector>
#include <numeric>
void TestThread(bool *pbFinished, int* pResult)
{
std::mutex mtx;
for (; !*pbFinished; (*pResult)++)
{
mtx.lock();
mtx.unlock();
}
}
void Test(int coreCnt)
{
const int ms = 3000;
bool bFinished = false;
std::vector<int> results(coreCnt);
std::vector<std::thread*> threads(coreCnt);
for (int i = 0; i < coreCnt; i++)
threads[i] = new std::thread(TestThread, &bFinished, &results[i]);
std::this_thread::sleep_for(std::chrono::milliseconds(ms));
bFinished = true;
for (std::thread* pThread: threads)
pThread->join();
int sum = std::accumulate(results.begin(), results.end(), 0);
printf("%d cores: %.03fm ops/sec\n", coreCnt, double(sum)/double(ms)/1000.);
}
int main(int argc, char** argv)
{
for (int cores = 1; cores <= (int)std::thread::hardware_concurrency(); cores++)
Test(cores);
return 0;
}
Results in Windows are very bad:
1 cores: 15.696m ops/sec
2 cores: 12.294m ops/sec
3 cores: 17.134m ops/sec
4 cores: 9.680m ops/sec
5 cores: 13.012m ops/sec
6 cores: 21.142m ops/sec
7 cores: 18.936m ops/sec
8 cores: 18.507m ops/sec
Linux manages to be an even bigger loser:
1 cores: 46.525m ops/sec
2 cores: 15.089m ops/sec
3 cores: 15.105m ops/sec
4 cores: 14.822m ops/sec
5 cores: 14.519m ops/sec
6 cores: 14.544m ops/sec
7 cores: 13.996m ops/sec
8 cores: 13.869m ops/sec
I have also tried using tbb's readers/writer lock, and even rolled my own.