How to manage multithreading with vectors of big dimension

Question

I'm trying to get my head around a problem in multithreading in c++.

In summary, I have an all-to-all shortest path problem on a graph. To solve it I call N times one-to-all queries (one for each node). To parallelize, I divide the totality of nodes between the available threads. I try this from 1 to 16 threads.

What happens is that my computational time decreases until 4/5 threads, then it goes up again. I narrowed the problem down on false sharing but I'm not sure anymore.

Here pieces of code:

auto start = std::chrono::high_resolution_clock::now();

size_t sz = all_nodes.size();
size_t np = config.n_threads;
size_t part = sz / np;

auto paraTask =
    [&](size_t start, size_t end, vector<int> &sol) {
        for (size_t l = start; l < end; ++l) {
            fun({all_nodes[l]});
        }
    };

for (size_t i = 0; i < np; i++) {
    size_t start = i * part;
    size_t length = (i + 1 == np) ? sz - i * part : part;
    threads[i] = std::thread(paraTask, start, start + length);
}

for (auto &&thread: threads) {
    thread.join();
}

double elapsed = getMs(start, std::chrono::high_resolution_clock::now());

Now, the function fun solve the problem for each node. In order to solve it (without too much details) I need to keep track of timings at each given node:

Given a node i, i want to find the shortest time to each other node. Therefore I keep a structure timings whose size is sizeof(all_nodes).

Here's the function fun

void fun(const Config &config, const vector<size_t> &all_nodes) const {
    // Marked stops
    unordered_set<size_t> marked_stops;
    unordered_set<size_t> new_marked_stops;

    size_t n_stops = stops.size();

    vector<int> tau(n_stops, 1440 * 3);

    for (size_t p: nearest_ps) {
        marked_stops.insert(p);
        tau[p] = config.start_time;
    }

    // Process each route
    for (size_t p: marked_stops) {
        for (size_t id_trip: stops[p].by_trips) { // all trip passing from stop p
            const auto &t = seq_trips[id_trip];
            for (size_t l = 0, l_max = t.size(); l < l_max; ++l) { // analyze single trip
                // If we can take that route from p: Mark all following stops if necessary
                    for (; l < l_max; ++l) {
                        //If the following stops can be improved
                            tau[t[l].stop_id] = t[l].arrival_time;
                            new_marked_stops.insert(t[l].stop_id);
                    }
            }
        }
    }

    // Reset marked stops
    std::swap(marked_stops, new_marked_stops);
    new_marked_stops.clear();

I removed the parts not of interest, timing the code it turns out the the time problem is related to the structure tau itself.

Now, for 1 thread I have one of this structure in memory at a time. But for N threads I have N times this structure at the same time. And this is the problem of slowing down when the number of threads increases.

So here is my question: what is happening in the memory? some of the threads are forced to pase for some swapping with the cache with other threads? and why is that?

My architecture is:

CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
RAM - 16 GB DDR4
Windows 11 Compiler - MS_VS 2022

DRAM Hardware (from CPU-Z report):

And the use of std::thread::hardware_concurrency returns 16 threads

EDIT2 : I added function fun and how I time in main function and a report of the RAM memory.

This processor has 16 thread and 8 cores. If `fun` is compute bound, then this should scale. Unfortunately, the code of `fun` is not provided so we cannot tell. Please provides its code. Besides, please also provide how you measure the timings and the timings themselves. Creating thread is quite slow and it can be a problem if your code takes less than <1 ms (platform-dependent). Finally the memory mapping of the memory can impact the memory throughput but no information is provided on the DRAM hardware (#channels?size of each DIMM? etc.). Please also add this. — Jérôme Richard, Mar 29 '23 at 18:03
Please consider providing a [Minimal, Reproducible Example](https://stackoverflow.com/help/minimal-reproducible-example) so we can reproduce the problem and better help you. — Jérôme Richard, Mar 29 '23 at 18:04

Jérôme Richard · Accepted Answer · 2023-04-01T14:49:41.173

The tau vector is likely not an issue here as long as is is small enough to fit in the L2 cache. Your processor has L2 caches of 512 KiB and each core has its own dedicated L1/L2 cache. Thus, as long as the vector size is smaller than 65536 items, it should not be an issue. Between 65536 and 131072, the number of cache misses can start to be an issue but the effect should be quite small. Beyond, the shared L3 cache is use and this can be a significant issue. Indeed, the L3 cache (LLC) is shared between all cores. Parallel random accesses in the L3 can cause evictions of cache lines useful for other thread running on different cores. If one tau is large enough to fit in the L3 cache (or close to), then it is normal for your application not to scale with many threads since there is likely many random accesses performed in a dataset not fitting in the L3 cache at all. This is the price to pay for many workers to operate in the same small room : they step on each other.

Your CPU has 8 cores. Having 16 thread, that is, 2 threads per core (i.e. SMT a.k.a. Hyper-threading) makes this even worst since twice more data per core need to be stored in the cache. This likely causes more cache misses here so having more than 8 threads may not be beneficial (unless the latency can be hidden by the SMT execution). Thus, I do not expect the application to scale with more than 8 threads.

If the vector is huge and has a power-of-two size, then you can experiment some critical cache-trashing issue. Consider reading this post if you are in this case.

If your vector is really huge and it does not even fit in the L3 cache and the accesses are uniformly spread, then it might be better to store only 1 integers per cache line (16x bigger tau) and store cache lines directly. This can be done using non-temporal SSE intrinsics. This solution suffers from a much bigger memory footprint and can also cause scalability issue due to the higher amount of data being transferred to the memory (but much less from the RAM due to the write allocation policy). If the accesses follow an exponential-like distribution, then this method is not a bad idea. Indeed, in this case most of the cache accesses should hit in the current implementation and bypassing caches (mostly) prevent cache hits.

In fact, the memory throughput might already be an issue in your case. If tau is so big that the N threads, having their own tau, cannot store it completely in the L3 cache, then the resulting cache misses cause data to be fetched from the main memory. If memory accesses are uniform, only 4 bytes over 64 (the size of a cache line on x86 CPUs) are actually used. The cache line needs to be read from the RAM so to be then modified and written back due to the way caches works on your target CPU. This means 128 bytes are transferred just for 4 useful bytes. Having more threads helps to hide the latency of this expensive cache misses, but the throughput of your RAM is bounded so this can limit the scalability of the application. Actually, the theoretical bandwidth of your RAM is 1333e6*2*8/1024**3 = 19.9 GiB/s. Modern processors often barely succeed to reach more than 80% of this bandwidth in practice when writing data (or mixing read/write). This is even more difficult for them to efficiently use your RAM when accesses are random because of DDR4-DRAM banks (despite the name RAM meaning "random access memory"). With more threads, accesses tends to look more random from the integrated memory controller making things worst. AFAIK, Intel CPUs prefetch 2 cache lines so to mitigate the latency overhead at the expense of a higher throughput. Thus, I do not expect your processor to reach more than 40% of your RAM bandwidth (optimistic situation), that is, about 8 GiB/s. This means 1 GiB/core. This pretty low. In fact, it is sufficiently low for this to be a major bottleneck. That being said, I made several assumptions that may not be true (e.g. uniform accesses). I advise you to check this hypothesis using low-level profilers. You can use VTune on Windows so to analyse the memory throughput of your application. VTune can be very useful to report other low-level information in your question like the amount of L1/L2/L3 cache hit/misses.

The access to new_marked_stops are likely to be the main bottleneck. Indeed, the std::unordered_set<size_t> data structure is typically implemented using an array of buckets containing a linked list. This is called separate chaining. Such an implementation allocates new nodes in the linked list when a new entry is inserted (i.e. not already present in the set). The nodes location is often quite random when a program runs for a long time causing memory diffusion. This effects tends to cause more cache misses, especially when the data structure is big. When the number of items is large, the hash-map needs to be resized which is an expensive process causing a lot of cache misses. Such cache misses can results in a cache trashing effect reducing the scalability of the application.

Even worst : allocations/frees tend not to scale on most platforms (AFAIK, especially Windows). Indeed, atomic operations and locks done during new/delete/malloc/free are serialized so having a lot of allocation cause the execution to be rather sequential. Atomic/lock contentions can even make the resulting execution slower (typically due to a cache line bouncing effect and the high latency of the L3 cache).

Assuming this is the bottleneck, one simple solution is not to use this data structure. Hash-map (or similarly hash-set) data structure using open addressing do not suffer from this issue. The C++ standard implementations generally do not use them because this is very difficult (if even possible) to do efficiently according to the current C++ specification constraints. That being said, you can use an external library to do that. The one of Tessil are a good start. The one of Martin Leitner-Ankerl is also a good alternative. They claim to to provide very-high performance compared to the one of the standard library (and many benchmarks proves them right so far). They should also make no allocations unless the hash-map is resized. You can preallocated it to a reasonable size so to avoid allocations when they have a small size. Be careful though : using a bigger hash-map tends to cause higher cache misses. As long as the hash-map fits in the L1/L2 cache, it should be fine.

If the hash-map tends to be big in your case and if most items are already in it, then you can use bloom filters to speed the computation up. This probabilistic data structure is much more compact so it can significantly reduce the number of cache misses resulting in a more scalable execution.

Thanks for the long answer! I'll try to remove the operations on new_marked stop and see what happens on the time. Does this unorder set suffers of the same things you mentioned earlier when using more threads? (so cache and so on)? — Claudio Tomasi, Apr 02 '23 at 12:37
Yes, unorder sets do random accesses in memory so they are very sensitive to the cache hierarchy (especially due to hashing and separate chaining). — Jérôme Richard, Apr 02 '23 at 12:42
I see! Just a last point, do you know some references explaining the behavior between thread and memory, caches, all these stuff, to understand them better in a general case and not only this specific one. Thank you again — Claudio Tomasi, Apr 02 '23 at 18:20
Books are certainly a good way to understand the basics on such topics. I personally recommand the free book of Victor Eijkhout : [Introduction to High-Performance Scientific Computing](https://web.corral.tacc.utexas.edu/CompEdu/pdf/stc/EijkhoutIntroToHPC.pdf) (only the part 1 is useful for your needs). After that, the Wikipedia article for caches can also be interesting : https://en.wikipedia.org/wiki/CPU_cache. — Jérôme Richard, Apr 02 '23 at 23:04
ok I tried to cancel the operation with new_marked_stops, and the same happens when going up with the number of threads — Claudio Tomasi, Apr 03 '23 at 10:25
What do you mean by "cancel the operation with new_marked_stops"? — Jérôme Richard, Apr 04 '23 at 17:11
commenting the part where we add new marked stops, I only maintain the initial marked_marked stops — Claudio Tomasi, Apr 05 '23 at 08:22
Ok. I guess It means the allocations was not the bottleneck in practice. Possibly if most of the values are already appended quickly. Thus, the cache effects are likely the source of slowdown. — Jérôme Richard, Apr 05 '23 at 21:17
Therefore we return to the cache misses due to the larger structures (tau) — Claudio Tomasi, Apr 07 '23 at 11:48

How to manage multithreading with vectors of big dimension

1 Answers1