Why is it slow when OpenMP runs in parallel independent instances of shortest path algorithm that uses BFS (BFS in itself is not parallelized)?

Question

I am trying to run multiple instances of a shortest path algorithm, for independent pairs of nodes in a graph.
The algorithm internally uses the bfs method, which is "sandboxed" (self-contained; no writing of variables outside the method). The graph consists of 600,000 nodes and 2,000,000 edges.
I ran the parallelized and sequential iterations for 14k nodes (from which the independent pairs of nodes were chosen), the sequential iterations take 30 minutes while openmp takes 50 minutes.

I would expect openmp to be faster - there is no critical section or complicated bottlenecks

int bfs(int src, int dst, int thread) {
    const int _VERTEX = 600000; 
    std::stringstream msg;
    msg << "Finding shortest path using bfs from : " << src <<" to " << dst <<   " using thread " << thread << endl;
    
    cout << msg.str();
    
    std::vector<int>dist (_VERTEX,INT_MAX);
    std::vector<bool> visited(_VERTEX,false);
    
    std::queue <int> q;
    q.push(src);
    visited[src] = true;
    dist[src] = 0;
    
    while (!q.empty()) {
        int size = q.size();
        while (size--) {
            int curr = q.front();
            q.pop();
            for (vector<int> ::iterator it = _ADJLST[curr].begin();  it != _ADJLST[curr].end(); ++it) {
                if (visited[*it]) {continue;}
    
                if (dist[*it] > dist[curr] +1) {
                    dist[*it] = dist[curr] + 1;
                    q.push(*it);
                }
                // if (curr == dst) {return dist[dst];}
                visited[*it] = 1;
            }
        }
    }
    return dist[dst];
}

I know bfs can be optimized further by returning dist as soon as dst is found.

And here is my main function:


int main() {

    int i, tid, nthreads;
    
    // #pragma omp parallel for private(i,tid,threads)  
    schedule(dynamic,1)
    #pragma omp parallel for private(i,tid,nthreads)
    for (i = 0 ; i < _PAIRLST.size(); i++) {
        int src = _PAIRLST[i].first ;
        int dst = _PAIRLST[i].second;
    
        tid = omp_get_thread_num();
        bfs(src,dst,tid);

}

_PAIRLST has size of 14k, nodes stored to be processed. _ADJLST is an unordered_map of vectors storing an adjacency list.

compiled with gcc-8.2.0 with -Ofast

I also experimented with static scheduling of OpenMP and not setting the thread by default, and I got a similar result.

Is there something I am missing here?

My hardware:
| Key | Value |
|:----------------- |:------------------------------------------ |
|Architecture | x86_64 |
|CPU op-mode(s) | 32-bit, 64-bit |
|Byte Order | Little Endian |
|CPU(s) | 48 |
|On-line CPU(s) list| 0-47 |
|Thread(s) per core | 2 |
|Core(s) per socket | 12 |
|Socket(s) | 2 |
|NUMA node(s) | 2 |
|Vendor ID | GenuineIntel |
|CPU family | 6 |
|Model | 85 |
|Model name | Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz|
|Stepping | 4 |
|CPU MHz | 2300.000 |
|BogoMIPS | 4600.00 |
|Virtualization | VT-x |
|L1d cache | 32K |
|L1i cache | 32K |
|L2 cache | 1024K |
|L3 cache | 16896K |
|NUMA node0 CPU(s) | 0-11,24-35 |
|NUMA node1 CPU(s) | 12-23,36-47 |

Linux 3.10.0-1160.80.1.el7.x86_64 #1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

as per Hari's request, I modified the bfs method to allocate less memory:

int bfs(int src, int dst , int thread) {
    std::stringstream msg;
    msg << "Finding shortest path using bfs from : " << src <<" to " << dst <<   " using thread " << thread << endl;

    cout << msg.str() << std::flush;

    // std::vector<int>  dist (_VERTEX,INT_MAX);
    // std::vector<bool> visited(_VERTEX,false);

    std::unordered_set<int> visited;
    std::unordered_map<int,int> dist;

    std::queue <int> q;
    q.push(src);
    visited.insert(src);
    dist[src] = 0;

    while (!q.empty()) {
        int size = q.size();
        while (size--) {
            int curr = q.front();
            q.pop();
            for (vector<int> ::const_iterator it = _ADJLST[curr].begin();  it != _ADJLST[curr].end(); ++it) {
                if (visited.find(*it)  != visited.end()) {continue;}

                if (dist.find(*it) == dist.end()) {dist[*it] = INT_MAX ; }

                if (dist[*it] > dist[curr] +1) {
                    dist[*it] = dist[curr] + 1;
                    q.push(*it);
                }
                // if (curr == dst) {return dist[dst];}
                // visited[*it] = 1;
                visited.insert(*it);
            }
        }
    }
    return dist[dst];
}

Edit:

Here are some of my logs: Finding shortest path using bfs from : 9661 to 0 using thread 11

Finding shortest path using bfs from : 1787 to 0 using thread 2

Finding shortest path using bfs from : 9661 to 0 using thread 11

I am aware of the missing optimizations in the bfs algorithm itself.

Is it possible that the thread number given by omp_get_thread_num() are just dummies?

Independently of OpenMP, not all algorithm scale. In fact, BFS is one algorithm that tends not to scale because of the interaction between threads (or synchronizations). Besides, you tell OpenMP to create 200 thread bu I hardly believe you have a machine with 200 cores. Creating thread is expensive. Splitting 1 task to 200 and giving them to 1 human does not speed up the completion of the task. In fact, if tasks are independent, giving them to 200 might not be better. The first thing to do is to reduce the number of thread and create the parallel section only once. — Jérôme Richard, May 08 '23 at 12:15
200 threads was just an experiment. even without that- the result is same. I will try simply using parallel section once. — kil47, May 08 '23 at 12:19
I do not understand how your parallel code is supposed to work without "critical section or complicated bottlenecks". A BFS needs not to travel nodes twice and this requires a synchronization/communication strategy. Setting `visited` in parallel and letting threads accessing the variable in parallel result in a *race condition* and so an *undefined behaviour* (ie. possibly wrong results). In this case, performance is the least of your problem (and race conditions do affect performance due to the way CPU caches works). — Jérôme Richard, May 08 '23 at 12:20
I was under the assumption that the thread will have its own memory of the variables under a function . Here dist/visited are local variable. is that not the case ? — kil47, May 08 '23 at 12:27
It looks like the OP is trying to find shortest distance using BFS between pairs of nodes given in a list (_PAIRLST). Since each pair is independent, the BFS used to find the shorted path for any pair of nodes is running independently of the other pairs. @kil47, you could try a smaller value for _VERTEX (in other words a smaller graph) and see the effect of the parallelization. There are `dist` and `visited` vectors within each BFS routine. If _VERTEX is large, there are so many huge vectors occupying a lot of memory. — Hari, May 08 '23 at 12:33
@Hari sure i try this but in the meanwhile - is my assumption correct that there is no race condition here mentioned by Jerome Richard — kil47, May 08 '23 at 13:03
@kil47, In your code there are no variables shared by the threads. Thus, there will be no race conditions. What Jerome Richard is stating is very relevant to the situation where BFS is itself parallelized. In that case there will be coordination between the threads. In your case, BFS is running inside a shortest path algorithm and shortest paths are being found in an independent manner. There is no coordination between the threads. — Hari, May 08 '23 at 13:10
attached the modified code with low memory initialization .. still performance does not improve — kil47, May 08 '23 at 13:29
@kil47, even if you do not allocate memory upfront for dist and visited, if you work with a large graph then those data structures are going to become very big as BFS runs. Did you try with a smaller graph? Also, it is a good idea to check whether the shortest path algorithm is working properly. — Hari, May 08 '23 at 13:38
yes... on the smaller dataset.. it does work properly. The distance array/map is filled with correct values — kil47, May 08 '23 at 13:52
Side note: be careful with `-Ofast` it breaks some language rules and can generate incorrect results in some cases. In most cases you should prefer `-O2` or `-O3`. — Jesper Juhl, May 08 '23 at 16:18
Did you time the execution with various number of threads ? 1, 2, 4, 8,... Quite often there is forst a speed-up, then a slow-down when the number of threads gets too high. — PierU, May 08 '23 at 21:06
@PierU : I tried with 4 and 8 - i do not observe any difference in result. — kil47, May 09 '23 at 04:28
Could you provide a (minimal reproducible example)[https://stackoverflow.com/help/minimal-reproducible-example]? Talking without testing the code on our own is just pointless at this stage... — PierU, May 09 '23 at 07:00
@PierU : minimum reproducible is provided by Hari in his post below. I am afraid the problem is not visible at such small dataset. but yes MRE is in the post below by Hari — kil47, May 09 '23 at 08:03

Hari · Answer 1 · 2023-05-09T08:30:36.407

The goal is to find shortest distance using BFS between pairs of nodes given in a list. Since each pair is independent, the BFS used to find the shorted path for any pair of nodes is running independently of the other pairs. There are no variables shared by the threads, i.e., there is no coordination between the threads. Thus, there will be no race conditions.

The code below runs for a toy graph (4 nodes, 5 edges and six pairs of source and target nodes). With such a light load, the total amount of work done is very less. In order to test the effect of parallelization, artificial delay is added by calling sleep for 5 seconds in the for loop. When the number of threads is set to 1, then the total time taken is around 30 seconds. When the number of threads is set to 6, then the total time is around 5 seconds. The code was run in Ubuntu and time was used to get the timing information. Thus, we can see that the speed-up is happening as expected. Also, the shortest distances computed are correct.

Possible reasons for slow down:

Could it be the data structures dist and visited, which are present in each thread? Less likely. Each are of the order of number of nodes and for around million nodes, they together occupy a few MBs (See discussion in the comments).
Could it be the threads accessing the adjacency list data structure in an uncoordinated manner? If one thread could re-use the information loaded in the cache by another thread, it will be better. This could be investigated.

Note: As per the answer here, the steps used to set the number of threads in OpenMP ensures that that many threads are created.

#include <climits>

#include <iostream>
#include <queue>
#include <vector>

#include <unistd.h>

#include "omp.h"

const int kVertex = 4;

const std::vector<std::vector<int>> kAdjList{{1, 2, 3}, {0, 2}, {0, 1, 3}, {0, 2}};

const std::vector<std::pair<int, int>> kPairList{{0, 1}, {0, 2}, {0, 3},
                                                 {1, 2}, {1, 3},
                                                 {2, 3}};

int bfs(int src, int dst , int thread) {
   std::vector<int> dist(kVertex,INT_MAX);
   std::vector<bool> visited(kVertex,false);

   std::queue<int> q;
   q.push(src);
   visited[src] = true;
   dist[src] = 0;

   while (!q.empty()) {
       int size = q.size();
       while (size--) {
           int curr = q.front();
           q.pop();
           for (const auto &it : kAdjList[curr]) {
               if (visited[it]) {continue;}

               if (dist[it] > dist[curr] +1) {
                   dist[it] = dist[curr] + 1;
                   q.push(it);
               }
               // if (curr == dst) {return dist[dst];}
               visited[it] = 1;
           }
       }
   }

   return dist[dst];
}

int main() {
    // Force OpenMP to use a given number of threads.
    omp_set_dynamic(0);
    omp_set_num_threads(6);

    #pragma omp parallel for
    for (int i = 0; i < kPairList.size(); ++i) {
        sleep(5);

        std::stringstream op_str;
        op_str<<"distance between "<<kPairList[i].first
               <<" "<<kPairList[i].second
               <<" "<<bfs(kPairList[i].first, kPairList[i].second, omp_get_thread_num())
               <<std::endl;
        std::cout<<op_str.str();
    }

    return 0;
}

it definitely seems that way - the 2 huge vectors causes the thread to be slow. I would need to verify using profiler though. But in practicality i would need to run it on those vectors ; any recommendations to make it faster ? — kil47, May 08 '23 at 20:07
@kil47 how large can they get (in bytes), and how much RAM do you have ? Even if they are not that large, the algorithm may be memory-bound (thus not scaling well). — PierU, May 08 '23 at 21:04
_VERTEX size is 600000 ; so for dist and bool vector : ~2.5 + ~0.6 = ~3.1MB per method call (apart from initial one time memory allocation of adjacency list and pairLst)... For testing I have no limit on RAM but currently I am using machine with 8GB RAM. Here is how i have calculated the memory : (sizeof(int) * dist.capacity()) /1000000.0 ; (sizeof(bool) * visited.capacity()) /1000000.0 ; — kil47, May 09 '23 at 04:26
exactly.. what i think the thread counts shown are just dummies ! - omp_get_thread_num () . any tool to profile threading on linux - which can actually show if the threads are being created on HW level ? — kil47, May 09 '23 at 06:45
Indeed, for 600000 nodes the size of the two data structures is small enough compared to a RAM of a few GBs. Next time I should explicitly do the calculation of the memory requirement. Could you share a graph where you don't see the expected speed up? — Hari, May 09 '23 at 06:51

Why is it slow when OpenMP runs in parallel independent instances of shortest path algorithm that uses BFS (BFS in itself is not parallelized)?

1 Answers1