I am trying to run multiple instances of a shortest path algorithm, for independent pairs of nodes in a graph.
The algorithm internally uses the bfs
method, which is "sandboxed" (self-contained; no writing of variables outside the method). The graph consists of 600,000 nodes and 2,000,000 edges.
I ran the parallelized and sequential iterations for 14k nodes (from which the independent pairs of nodes were chosen), the sequential iterations take 30 minutes while openmp takes 50 minutes.
I would expect openmp to be faster - there is no critical section or complicated bottlenecks
int bfs(int src, int dst, int thread) {
const int _VERTEX = 600000;
std::stringstream msg;
msg << "Finding shortest path using bfs from : " << src <<" to " << dst << " using thread " << thread << endl;
cout << msg.str();
std::vector<int>dist (_VERTEX,INT_MAX);
std::vector<bool> visited(_VERTEX,false);
std::queue <int> q;
q.push(src);
visited[src] = true;
dist[src] = 0;
while (!q.empty()) {
int size = q.size();
while (size--) {
int curr = q.front();
q.pop();
for (vector<int> ::iterator it = _ADJLST[curr].begin(); it != _ADJLST[curr].end(); ++it) {
if (visited[*it]) {continue;}
if (dist[*it] > dist[curr] +1) {
dist[*it] = dist[curr] + 1;
q.push(*it);
}
// if (curr == dst) {return dist[dst];}
visited[*it] = 1;
}
}
}
return dist[dst];
}
I know bfs
can be optimized further by returning dist as soon as dst
is found.
And here is my main
function:
int main() {
int i, tid, nthreads;
// #pragma omp parallel for private(i,tid,threads)
schedule(dynamic,1)
#pragma omp parallel for private(i,tid,nthreads)
for (i = 0 ; i < _PAIRLST.size(); i++) {
int src = _PAIRLST[i].first ;
int dst = _PAIRLST[i].second;
tid = omp_get_thread_num();
bfs(src,dst,tid);
}
_PAIRLST
has size of 14k, nodes stored to be processed.
_ADJLST
is an unordered_map
of vectors storing an adjacency list.
compiled with gcc-8.2.0 with -Ofast
I also experimented with static scheduling of OpenMP and not setting the thread by default, and I got a similar result.
Is there something I am missing here?
My hardware:
| Key | Value |
|:----------------- |:------------------------------------------ |
|Architecture | x86_64 |
|CPU op-mode(s) | 32-bit, 64-bit |
|Byte Order | Little Endian |
|CPU(s) | 48 |
|On-line CPU(s) list| 0-47 |
|Thread(s) per core | 2 |
|Core(s) per socket | 12 |
|Socket(s) | 2 |
|NUMA node(s) | 2 |
|Vendor ID | GenuineIntel |
|CPU family | 6 |
|Model | 85 |
|Model name | Intel(R) Xeon(R) Gold 5118 CPU @ 2.30GHz|
|Stepping | 4 |
|CPU MHz | 2300.000 |
|BogoMIPS | 4600.00 |
|Virtualization | VT-x |
|L1d cache | 32K |
|L1i cache | 32K |
|L2 cache | 1024K |
|L3 cache | 16896K |
|NUMA node0 CPU(s) | 0-11,24-35 |
|NUMA node1 CPU(s) | 12-23,36-47 |
Linux 3.10.0-1160.80.1.el7.x86_64
#1 SMP Sat Oct 8 18:13:21 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
as per Hari's request, I modified the bfs
method to allocate less memory:
int bfs(int src, int dst , int thread) {
std::stringstream msg;
msg << "Finding shortest path using bfs from : " << src <<" to " << dst << " using thread " << thread << endl;
cout << msg.str() << std::flush;
// std::vector<int> dist (_VERTEX,INT_MAX);
// std::vector<bool> visited(_VERTEX,false);
std::unordered_set<int> visited;
std::unordered_map<int,int> dist;
std::queue <int> q;
q.push(src);
visited.insert(src);
dist[src] = 0;
while (!q.empty()) {
int size = q.size();
while (size--) {
int curr = q.front();
q.pop();
for (vector<int> ::const_iterator it = _ADJLST[curr].begin(); it != _ADJLST[curr].end(); ++it) {
if (visited.find(*it) != visited.end()) {continue;}
if (dist.find(*it) == dist.end()) {dist[*it] = INT_MAX ; }
if (dist[*it] > dist[curr] +1) {
dist[*it] = dist[curr] + 1;
q.push(*it);
}
// if (curr == dst) {return dist[dst];}
// visited[*it] = 1;
visited.insert(*it);
}
}
}
return dist[dst];
}
Edit:
Here are some of my logs: Finding shortest path using bfs from : 9661 to 0 using thread 11
Finding shortest path using bfs from : 1787 to 0 using thread 2
Finding shortest path using bfs from : 9661 to 0 using thread 11
I am aware of the missing optimizations in the bfs algorithm itself.
Is it possible that the thread number given by omp_get_thread_num() are just dummies?