I'm building some formatted strings using std::ostringstream
. When running on a single thread, code profiling shows no bottle neck caused by std::ostringstream
.
When I start using more threads, std::ostringstream
slows down due to std::__1::locale::locale
.
This gets worse and worse as more threads are used.
I'm not performing any thread synchronization explicitly but I suspect something inside std::__1::locale::locale
is causing my threads to block which gets worse as I use more threads. It's the difference between a single thread taking ~30 seconds and 10 threads taking 10 minutes.
The code is in question is small but called many times,
static std::string to_string(const T d) {
std::ostringstream stream;
stream << d;
return stream.str();
}
When I change it to avoid constructing a new std::ostringstream
every time,
thread_local static std::ostringstream stream;
const std::string clear;
static std::string to_string(const T d) {
stream.str(clear);
stream << d;
return stream.str();
}
I recover multithreaded performance but single thread performance suffers. What can I do to avoid this problem? The strings built here never need to be human readable. They are only used so that I can work around the lack of a hash function for std::complex
. Is there away to avoid localization when building formatted strings?
#include <map>
#include <sstream>
#include <complex>
#include <iostream>
#include <thread>
#include <chrono>
thread_local std::map<std::string, void *> cache;
int main(int argc, const char * argv[]) {
for (size_t i = 1; i <= 10; i++) {
const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
std::vector<std::thread> threads(i);
for (auto &t : threads) {
t = std::thread([] () -> void {
for (size_t j = 0; j < 1000000; j++) {
std::ostringstream stream;
stream << std::complex<double> (static_cast<double> (j));
cache[stream.str()] = reinterpret_cast<void *> (&j);
}
});
}
for (auto &t : threads) {
t.join();
}
const std::chrono::high_resolution_clock::time_point end =
std::chrono::high_resolution_clock::now();
const auto total_time = end - start;
const std::chrono::nanoseconds total_time_ns =
std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);
if (total_time_ns.count() < 1000) {
std::cout << total_time_ns.count() << " ns" << std::endl;
} else if (total_time_ns.count() < 1000000) {
std::cout << total_time_ns.count()/1000.0 << " μs" << std::endl;
} else if (total_time_ns.count() < 1000000000) {
std::cout << total_time_ns.count()/1000000.0 << " ms" << std::endl;
} else if (total_time_ns.count() < 60000000000) {
std::cout << total_time_ns.count()/1000000000.0 << " s" << std::endl;
} else if (total_time_ns.count() < 3600000000000) {
std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
} else {
std::cout << total_time_ns.count()/3600000000000 << " h" << std::endl;
}
std::cout << std::endl;
}
return 0;
}
Running on an 10 core (8 performance, 2 efficiency)Apple M1 produces the output. Build setting are using the standard Xcode defaults. For a debug build the timings are
3.90096 s
4.15853 s
4.48616 s
4.843 s
6.15202 s
8.14986 s
10.6319 s
12.7732 s
16.7492 s
19.9288 s
For a Release build, the timings are
844.28 ms
1.23803 s
2.05088 s
3.39994 s
7.43743 s
9.53968 s
11.2953 s
12.6878 s
20.3917 s
24.1944 s