I'm trying to get my head around a problem in multithreading in c++.
In summary, I have an all-to-all shortest path problem on a graph. To solve it I call N times one-to-all queries (one for each node). To parallelize, I divide the totality of nodes between the available threads. I try this from 1 to 16 threads.
What happens is that my computational time decreases until 4/5 threads, then it goes up again. I narrowed the problem down on false sharing but I'm not sure anymore.
Here pieces of code:
auto start = std::chrono::high_resolution_clock::now();
size_t sz = all_nodes.size();
size_t np = config.n_threads;
size_t part = sz / np;
auto paraTask =
[&](size_t start, size_t end, vector<int> &sol) {
for (size_t l = start; l < end; ++l) {
fun({all_nodes[l]});
}
};
for (size_t i = 0; i < np; i++) {
size_t start = i * part;
size_t length = (i + 1 == np) ? sz - i * part : part;
threads[i] = std::thread(paraTask, start, start + length);
}
for (auto &&thread: threads) {
thread.join();
}
double elapsed = getMs(start, std::chrono::high_resolution_clock::now());
Now, the function fun
solve the problem for each node. In order to solve it (without too much details) I need to keep track of timings at each given node:
Given a node i, i want to find the shortest time to each other node. Therefore I keep a structure timings
whose size is sizeof(all_nodes)
.
Here's the function fun
void fun(const Config &config, const vector<size_t> &all_nodes) const {
// Marked stops
unordered_set<size_t> marked_stops;
unordered_set<size_t> new_marked_stops;
size_t n_stops = stops.size();
vector<int> tau(n_stops, 1440 * 3);
for (size_t p: nearest_ps) {
marked_stops.insert(p);
tau[p] = config.start_time;
}
// Process each route
for (size_t p: marked_stops) {
for (size_t id_trip: stops[p].by_trips) { // all trip passing from stop p
const auto &t = seq_trips[id_trip];
for (size_t l = 0, l_max = t.size(); l < l_max; ++l) { // analyze single trip
// If we can take that route from p: Mark all following stops if necessary
for (; l < l_max; ++l) {
//If the following stops can be improved
tau[t[l].stop_id] = t[l].arrival_time;
new_marked_stops.insert(t[l].stop_id);
}
}
}
}
// Reset marked stops
std::swap(marked_stops, new_marked_stops);
new_marked_stops.clear();
I removed the parts not of interest, timing the code it turns out the the time problem is related to the structure tau
itself.
Now, for 1 thread I have one of this structure in memory at a time. But for N threads I have N times this structure at the same time. And this is the problem of slowing down when the number of threads increases.
So here is my question: what is happening in the memory? some of the threads are forced to pase for some swapping with the cache with other threads? and why is that?
My architecture is:
- CPU - 11th Gen Intel(R) Core(TM) i7-11700KF @ 3.60GHz
- RAM - 16 GB DDR4
- Windows 11 Compiler - MS_VS 2022
DRAM Hardware (from CPU-Z report):
And the use of std::thread::hardware_concurrency
returns 16 threads
EDIT2 : I added function fun
and how I time in main function and a report of the RAM memory.