Why is locale causing std::ostringstream to get slower as I use more threads?

Question

I'm building some formatted strings using std::ostringstream. When running on a single thread, code profiling shows no bottle neck caused by std::ostringstream.

When I start using more threads, std::ostringstream slows down due to std::__1::locale::locale.

This gets worse and worse as more threads are used.

I'm not performing any thread synchronization explicitly but I suspect something inside std::__1::locale::locale is causing my threads to block which gets worse as I use more threads. It's the difference between a single thread taking ~30 seconds and 10 threads taking 10 minutes.

The code is in question is small but called many times,

static std::string to_string(const T d) {
    std::ostringstream stream;
    stream << d;

    return stream.str();
}

When I change it to avoid constructing a new std::ostringstream every time,

thread_local static std::ostringstream stream;
const std::string clear;

static std::string to_string(const T d) {
    stream.str(clear);
    stream << d;

    return stream.str();
}

I recover multithreaded performance but single thread performance suffers. What can I do to avoid this problem? The strings built here never need to be human readable. They are only used so that I can work around the lack of a hash function for std::complex. Is there away to avoid localization when building formatted strings?

#include <map>
#include <sstream>
#include <complex>
#include <iostream>
#include <thread>
#include <chrono>

thread_local std::map<std::string, void *> cache;

int main(int argc, const char * argv[]) {
    for (size_t i = 1; i <= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector<std::thread> threads(i);
        for (auto &t : threads) {
            t = std::thread([] () -> void {
                for (size_t j = 0; j < 1000000; j++) {
                    std::ostringstream stream;
                    stream << std::complex<double> (static_cast<double> (j));
                    cache[stream.str()] = reinterpret_cast<void *> (&j);
                }
            });
        }
        for (auto &t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);

        if (total_time_ns.count() < 1000) {
            std::cout << total_time_ns.count()               << " ns"  << std::endl;
        } else if (total_time_ns.count() < 1000000) {
            std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
        } else if (total_time_ns.count() < 1000000000) {
            std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
        } else if (total_time_ns.count() < 60000000000) {
            std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
        } else if (total_time_ns.count() < 3600000000000) {
            std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
        } else {
            std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
        }
        std::cout << std::endl;
    }

    return 0;
}

Running on an 10 core (8 performance, 2 efficiency)Apple M1 produces the output. Build setting are using the standard Xcode defaults. For a debug build the timings are

For a Release build, the timings are

I think I'd need to see a [mre] to get some ideas about what could make 10 threads that slower than 1. How many processors does your machine have? Btw, if this is only to make a string out of `std::complex`s in order to be able to hash them, there are far more efficient ways. — Ted Lyngmo, Jun 21 '23 at 21:20
Because of rounding errors, I'm not sure that hashing `std::complex` (or indeed any floating point type) makes much sense anyway. What do you plan to use it for? — Paul Sanders, Jun 21 '23 at 21:42
@user1139069 Compare your hashing of `std::complex`s with [this](https://godbolt.org/z/PTszKodhc). Does it perform any better than using `ostringstream`s + `std::hash`? It's not _a good hashing function_ but neither is the one you have now and I don't think you'll get more collisions with my version. — Ted Lyngmo, Jun 21 '23 at 21:50
@PaulSanders Just so that `A, B` will produce a different result than `B, A`. Not very scientific :-) — Ted Lyngmo, Jun 21 '23 at 21:58
Depending on the platform, I seem to vaguely remember that there's a way to disable per-thread locales. — 500 - Internal Server Error, Jun 21 '23 at 22:13
@TedLyngmo I'm hashing several different types of items into the same map. I'm building a tree structure of math operations and I want to ensure that sub trees are only built once and reused. I need to make sure that the hash for constant node with a complex number is different than the hash for sqrt node or an array node, etc... Building the strings first is a simple way to ensure get compatible hash between all these different things. — user1139069, Jun 22 '23 at 00:19
@user1139069 I'm looking but I don't see where you've stated what compiler options you used to build the program, more to the point, optimization settings. If you are running a "debug" or non-optimized build, the timings are meaningless. — PaulMcKenzie, Jun 22 '23 at 00:40
Fortunately, NaNs always compare as != anything (even themselves) so the minor detail that there are lots of representation for NaN doesn't cause a problem. — Jerry Coffin, Jun 22 '23 at 02:59
@user1139069 _"I'm hashing several different types of items into the same map"_ - Ok, I'd only use the stringstream-based version as a very last resort. Make hashing specializations for the types you know will be used. You can't however _"make sure that the hash for constant node with a complex number is different than the hash for sqrt node etc"_. Hashes have collisions which is why you need to check for equality of the hashed objects if the hash values are equal. Btw, here's a slightly improved version of [hashing of `std::complex`s](https://godbolt.org/z/YYxKv5Eev). — Ted Lyngmo, Jun 22 '23 at 08:34
If you absolutely need a string-based hash, you could try using `sprintf`. Maybe that doesn't have this issue. — Paul Sanders, Jun 22 '23 at 13:28

user1139069 · Accepted Answer · 2023-06-23T02:44:58.983

Doing some digging for alternatives, the std::to_string notes

std::to_string relies on the current locale for formatting purposes, and therefore concurrent calls to std::to_string from multiple threads may result in partial serialization of calls. C++17 provides std::to_chars as a higher-performance locale-independent alternative.

Using std::to_chars in the minimum example instead results in much better performance to what I was expecting for an embarrassingly parallel code.

#include <map>
#include <sstream>
#include <complex>
#include <iostream>
#include <thread>
#include <chrono>
#include <charconv>
#include <limits>
#include <string>
#include <iomanip>

thread_local std::map<std::string, void *> cache;
thread_local std::map<std::string, void *> cache2;

void stream() {
    for (size_t i = 1; i <= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector<std::thread> threads(i);
        for (auto &t : threads) {
            t = std::thread([] () -> void {
                for (size_t j = 0; j < 1000000; j++) {
                    std::ostringstream stream;
                    stream << std::setprecision(std::numeric_limits<double>::max_digits10);
                    stream << std::complex<double> (static_cast<double> (j));
                    cache[stream.str()] = reinterpret_cast<void *> (&j);
                }
            });
        }
        for (auto &t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);

        if (total_time_ns.count() < 1000) {
            std::cout << total_time_ns.count()               << " ns"  << std::endl;
        } else if (total_time_ns.count() < 1000000) {
            std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
        } else if (total_time_ns.count() < 1000000000) {
            std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
        } else if (total_time_ns.count() < 60000000000) {
            std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
        } else if (total_time_ns.count() < 3600000000000) {
            std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
        } else {
            std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
        }
        std::cout << std::endl;
    }
}

void to_chars() {
    for (size_t i = 1; i <= 10; i++) {
        const std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
        std::vector<std::thread> threads(i);
        const size_t max_digits = std::numeric_limits<double>::max_digits10;
        for (size_t k = 0, ke = threads.size(); k < ke; k++) {
            threads[k] = std::thread([] () -> void {
                std::array<char, 36> buffer;
                for (size_t j = 0; j < 1000000; j++) {
                    char *end = std::to_chars(buffer.begin(), buffer.end(), static_cast<double> (j),
                                              std::chars_format::general, max_digits).ptr;
                    cache2[std::string(buffer.data(), end)] = reinterpret_cast<void *> (&j);
                }
            });
        }
        for (auto &t : threads) {
            t.join();
        }
        
        const std::chrono::high_resolution_clock::time_point end =
                  std::chrono::high_resolution_clock::now();
        const auto total_time = end - start;
        const std::chrono::nanoseconds total_time_ns =
                  std::chrono::duration_cast<std::chrono::nanoseconds> (total_time);

        if (total_time_ns.count() < 1000) {
            std::cout << total_time_ns.count()               << " ns"  << std::endl;
        } else if (total_time_ns.count() < 1000000) {
            std::cout << total_time_ns.count()/1000.0        << " μs"  << std::endl;
        } else if (total_time_ns.count() < 1000000000) {
            std::cout << total_time_ns.count()/1000000.0     << " ms"  << std::endl;
        } else if (total_time_ns.count() < 60000000000) {
            std::cout << total_time_ns.count()/1000000000.0  << " s"   << std::endl;
        } else if (total_time_ns.count() < 3600000000000) {
            std::cout << total_time_ns.count()/60000000000.0 << " min" << std::endl;
        } else {
            std::cout << total_time_ns.count()/3600000000000 << " h"   << std::endl;
        }
        std::cout << std::endl;
    }
}

int main(int argc, const char * argv[]) {
    stream();
    std::cout << "-----------------------------------------------------------" << std::endl;
    to_chars();
    return 0;
}

Results in timings of

854.078 ms

1.3472 s

2.26556 s

3.61298 s

7.55469 s

9.29697 s

11.321 s

12.6926 s

19.607 s

24.4866 s

-----------------------------------------------------------
403.037 ms

416.532 ms

432.433 ms

437.869 ms

450.775 ms

458.693 ms

473.683 ms

498.53 ms

528.434 ms

560.239 ms

Code profiling confirms the may string hashes are no longer the largest bottleneck.

Why is locale causing std::ostringstream to get slower as I use more threads?

1 Answers1