Workaround for writing in different entries of the same vector in multithreading

Question

I already described a similar problem, but only to undrstand its causes. If this counts as duplicated as well, I will remove the quetion

I work on a problem which can be thought of a sort of shortest path computation in really big graph. On this graph I solve a all-to-all shortest path problem, i.e., I find for each node n, each shortest path having n as source.

In order to compte the entire solution fastly I'm adopting multithreading. What I do is simply to divide the set of nodes between the threads and compute the shortest path in parallel. The problem comes when I save the solution.

size_t sz = all_nodes.size();
size_t np = config.n_threads;
size_t part = sz / np;
vector<vector<int>> solutions(np);

The vector containing ALL the solutions is a vector of vectors, and each thread works on a different entry of the vector solutions.

auto paraTask = [&](size_t start, size_t end, vector<int> &sol) {
                for (size_t l = start; l < end; ++l)
                    fun({all_nodes[l]}, sol);
            };
for (size_t i = 0; i < np; i++) {
                size_t start = i * part;
                size_t length = (i + 1 == np) ? sz - i * part : part;
                threads[i] =
                          std::thread(paraTask, start, start + length, std::ref(solutions[i]));
            }

The function fun write on solutions.

At the end I join the threads:

for (auto &&thread: threads) thread.join();

What happens is that the computationl time improves passing from 1 to 2 to 4 threads, but then it starts increasing again. I have 8 physical cores and 16 logical. In my understanding this is due to the resizing vectors (entries of solutions) that cause the nearby threads to pause themselves during this resizing procedures.

I tried to profile the code and reading the cache misses, and there ara around 4mln losses (I really am not sure on how to read this information). Therefore I tried using std::list<vector<int>> solutions hoping that this would overcome the proble related to contiguous memory existing with vectors, but also here after 4 threads the computational time starts to increase again.

How could I solve the problem ?

PS: as a 'stupid' solution I tried to hard coding the single threads: I impose 8 threads, I create a vector for each of them so that they are independent one from another, and they are not anymore entries of a vector of vectors. Also here the same happens, time increases with respect to 4 threads.

Any guess?

EDIT: Here a really small working example of function fun

void fun(const Config &config, const vector<size_t> &points,
                     vector<int> &solution) const {
            // Marked stops
            unordered_set<size_t> marked_stops;
            unordered_set<size_t> new_marked_stops;

            size_t n_stops = stops.size();

            int max_k = config.max_k;
            vector<vector<int>> tau;
            tau.reserve(max_k);

            for (int k = 0; k < max_k; ++k) {
                tau.emplace_back(n_stops, 1440 * 3);
            }

            // For each point, mark them as starting point
            for (size_t p: points) {
                logger.debug("stop: %d", p);
                marked_stops.insert(p);
                tau[0][p] = config.start_time_horizon;
                pred_stop[0][p] = p;
                pred_trip[0][p] = 0;
            }

           
            for (int k = 1; k < max_k; ++k) {
                // First stage: update tau from previous round
                for (size_t i = 0, i_max = n_stops; i < i_max; ++i) {
                    tau[k][i] = tau[k - 1][i];
                    pred_stop[k][i] = pred_stop[k - 1][i];
                    pred_trip[k][i] = pred_trip[k - 1][i];
                }

                
                for (size_t p: marked_stops) {
                    //for each line passing from p, check if I can take that line in p based on tau value 
                    //Check the line, and for each new stop encountered:
                        new_marked_stops.insert(new_stop);
                    
                }

                // Reset marked stops
                std::swap(marked_stops, new_marked_stops);
                new_marked_stops.clear();

                
            // Print results
            for (size_t i = 0, i_max = n_stops; i < i_max; ++i) {
                int tmp_k = -1;
                int tmp_t = 1440 * 3;
                for (int k = 1; k < max_k; ++k)
                    // Find origin of destination i
                if (tmp_k != -1) {
                    solution.push_back(
                            config
                                    .start_time_horizon);
                    solution.push_back(i);
                }
            }


        }

Without entering in details, we have a schedule and we check the schedule to see if we can take a line from a given stop and unlock new stops

EDIT: I removed the part where threads add information to their vectors in solution, and the same still happens, I have an increasing in time after 4 threads.

The problems might be in the allocation and reallocation needed for the vectors used inside the threads, not the wrapper vector which you have sized appropriately. — Some programmer dude, Mar 27 '23 at 12:44
but why this should cause an increasing in the comp. time, giving rise to a sort of scerio more threads = more time ? — Claudio Tomasi, Mar 27 '23 at 12:48
Sounds like the threads are bottlenecked with contention for a resource. More threads, more contention, more time. — Eljay, Mar 27 '23 at 12:51
It shouldn't really (unless your standard library implementation uses locks or something for its memory allocator), but then the wrapping vector itself, which is properly sized before any threads are running, shouldn't affect the thread performance either. So changing to a list won't help, you're chasing a red herring with that. — Some programmer dude, Mar 27 '23 at 12:52
So it all comes down to `fun` and what it does. What does it do? Is the `fun` function using any kind of shared resources? Are there any locks or other synchronization? — Some programmer dude, Mar 27 '23 at 12:53
we can say that `fun` find each one-to-all shortest path. Of course it reads from the graph information, but never write. It only write its only variables and on the portion of the vector `solutions` — Claudio Tomasi, Mar 27 '23 at 12:55
Maybe your problem is not the computation time but the memory transfer. If your vectors are stored in your RAM there is a good chance to lose more time in reading/writing than in the real data handle. — akira hinoshiro, Mar 27 '23 at 12:57
"[I]t reads from the graph information"... *What* "graph information"? Unless you can create a proper [mre] to show us, including a description of the underlying problem your program is supposed to solve, it's really hard to say anything. All we can do is guess, and guess badly. There are thousands of possible reasons a multi-threaded application can get worse efficiency when you add more threads, it's not always up to the hardware. — Some programmer dude, Mar 27 '23 at 13:01
@akirahinoshiro I tried with a smaller example (graph) and it starts to increase again after 15 threads, so maybe it could be a memory manage problem. How this kind of stuff could be solved ? and in general, why more thread should take more time ? the number of writing and reading in memory is the same, just not in sequence — Claudio Tomasi, Mar 27 '23 at 13:03
@Someprogrammerdude graph info I mean nodes and edges and connectivity. I will try to give a small working example — Claudio Tomasi, Mar 27 '23 at 13:04
@ClaudioTomasi If you have dual channel RAM for example and the whole execution time for a single thread implementation needs 90% of the time to read or write data there is no much benefit in using more threads. You have to optimize your code in chunks like your l1 cache per core and other things. Cache locality isn't easy to handle. — akira hinoshiro, Mar 27 '23 at 13:07
Does `fun` allocate any memory at all? Some memory allocators serialize memory allocations. — Botje, Mar 27 '23 at 13:07
@Botje yes, In order to solve one-to-all queries I keep info on node visited and at which time I visited so I can understand if I can do better using another road in the graph — Claudio Tomasi, Mar 27 '23 at 13:16

score 1 · Accepted Answer · answered Mar 27 '23 at 16:00

1

If I were doing it, I wouldn't have the threads write directly to the output vector. Instead, each thread would run and generate its result, then push it to a thread-safe queue along with something to identify the thread that produced it (if that matters).

Then a single thread would pop items from the queue, and put them in the correct place in the result vector. Keeps the data handling neat and tidy, and generally scales quite well.

answered Mar 27 '23 at 16:00

Jerry Coffin

476,176
80
629
1,111

Just to understand, we have now a shared queue, and each thread write on it, correct ? Or I have a queue for each thread ? In the first case, each writing is thread-safe, but if each thread needs to write more than one information, I need to lock that writing part, correct ? – Claudio Tomasi Mar 28 '23 at 07:48
and furthermore, does there exist a safe-thread queue structure already defined ? – Claudio Tomasi Mar 28 '23 at 12:55
Typically you'd have one queue shared between threads. But no, there isn't one in the standard library. Anthony Williams has quite a good article about how to write one though: https://www.justsoftwaresolutions.co.uk/threading/implementing-a-thread-safe-queue-using-condition-variables.html – Jerry Coffin Mar 28 '23 at 14:05
last thing, why a thread-safe queue and not a thread-safe vector? it has something to with contiguous memory in vectors, and resizing on push-backs? – Claudio Tomasi Mar 29 '23 at 07:59
I added an edit, probably the problem is not entirely on the sctructure – Claudio Tomasi Mar 29 '23 at 08:18
@ClaudioTomasi: As long as it has a queue-like interface where the computation thread just says something like: "output.push(result);`, it doesn't matter much how it's really stored. But you generally want the computation thread to deal solely with computation, leaving it up to the caller to deal with storing, organizing, etc., the results. – Jerry Coffin Mar 29 '23 at 16:00

Workaround for writing in different entries of the same vector in multithreading

1 Answers1