Efficient way to speeding up graph theory and complex network algorithms on CPU/GPU using Python?

Question

P.S.: I've mentioned possible solutions to my problem but have many confusions with them, please provide me suggestions on them. Also if this question is not good for this site, please point me to the correct site and I'll move the question there. Thanks in advance.

I need to perform some repetitive graph theory and complex network algorithms to analyze approx 2000 undirected simple graphs with no self-loops for some research work. Each graph has approx 40,000 nodes and approx 600,000 edges (essentially making them sparse graphs).

Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G) for 500 such graphs and the code is running for 3 days and have reached only halfway. This makes me fearful that my full analysis will take a huge and unexpected time.

Before elaborating on my problem and the probable solutions I've thought of, let me mention my computer's configuration as it may help you in suggesting the best approach. I am running Windows 10 on an Intel i7-9700K processor with 32GB RAM and one Zotac GeForce GTX 1050 Ti OC Edition ZT-P10510B-10L 4GB PCI Express Graphics Card.

Explaining my possible solutions and my confusions regarding them:

A) Using GPU with Adjacency Matrix as Graph Data Structure: I can put an adjacency matrix on GPU and perform my analysis by manually coding them with PyCuda or Numba using loops only as recursion cannot be handled by GPU. The nearest I was able to search is this on stackoverflow but it has no good solution.

My Expectations: I hope to speedup algorithms such as All Pair Shortest Path, All Possible Paths between two nodes, Average Clustering, Average Shortest Path Length, and Small World Properties, etc. If it gives a significant speedup per graph, my results can be achieved very fast.

My Confusions:

Could these graph algorithms can be efficiently coded in GPU?
Which will be better to use? PyCuda or Numba?
Is there any other way to store Graphs on GPU that could be more efficient as my graphs are sparse graphs.
I am an average Python Programmer with no experience of GPU programming, so I will have to understand and learn GPU programming with PyCuda/ Numba. Which one is easier to learn?

B) Parallelizing Programs on CPU Itself: I can use Joblib or any other library to parallelly run the program on my CPU itself. I can arrange 2-3 more computers on which I can run small independent portions of programs or can run 500 graphs per computer.

My Expectations: I hope to speedup algorithms by parallelizing and dividing tasks among computers. If the GPU solution does not work, I may still have some hope by this method.

My Confusions:

Which other libraries are available as good alternatives for Joblib?
Should I allot all CPU cores (8 cores in i7) for my programs or use fewer cores?

C) Apart from my probable solutions do you have any other suggestions for me? If a better and faster solution is available in any other language except C/C++, you can also suggest them as well, as I am already considering C++ as a fallback plan if nothing works.

Work In Progress Updates

In different suggestions from comments on this question and discussion in my community, these are the points I've suggested to explore.
- GraphBLAS
- boost.graph + extensions with python-wrappers
- graph-tool
- Spark/ Dask
- PyCuda/ Numba
- Linear Algerbra methods using Pytorch
I tried to run 100 graphs on my CPU (using n_job=-1) using Joblib, the CPU was continuously hitting a temperature of 100°C. The processor tripped after running for 3 hours. - As a solution, I am using 75% of available cores on multiple computers (so if available cores are 8, I am using 6 cores) and the program is running fine. the speedup is also good.

Please narrow your question down to one specific programming problem you encountered. — mkrieger1, Dec 08 '20 at 21:00
@mkrieger1, I'll try to reduce it, but it will be difficult for me as it is important that I mention maximum details about my problem and how I am thinking to solve them, to get the best suggestions. — thepunitsingh, Dec 08 '20 at 21:02
`My problem is multifold and need a detailed suggestion` then you probably need to find a different site to post this on. StackOverflow forbids complex, multi-tiered questions in favor of more focused ones. — Random Davis, Dec 08 '20 at 21:05
@RandomDavis Thanks for the suggestion. I'll be searching for more sites to move this question there. In case you are aware of any, please suggest. — thepunitsingh, Dec 08 '20 at 21:07
networkx is pure python and obviously slow compared to boost.graph or CoinOR lemon for example. Building those algorithms on top of those libraries will probably gain a lot. In regards to GPU, you might look into recent / modern work, usually coined *GraphBLAS* where it's tried to approach these algorithms as algebraic as possible reusing concepts from algebraic libraries like BLAS/LAPACK (this abstraction leads to interesting semi-ring usage). But to be honest: C++ based CPU approaches based on the mentioned libs would be much faster to develop and should help much. — sascha, Dec 08 '20 at 21:25
@sascha Thanks for your suggestions. I'll be surely looking into these without any delay. — thepunitsingh, Dec 08 '20 at 21:26
And if you really do not want to go for C++, maybe [graph-tool](https://graph-tool.skewed.de/) is an alternative. It's basically boost.graph + extensions with python-wrappers (from what i have read). Might be non-trivial to install though on windows. — sascha, Dec 08 '20 at 21:27
I want to look at alternatives before going with C++ as I've lost practice of it. Thanks again for directing me in the correct direction. — thepunitsingh, Dec 08 '20 at 21:30
There is a cuGraph or something library, but I haven't tried it. Also if you're taking the CPU route consider Spark which has graph support and/or Dask (maybe on pypy or with modin or some other similar library). Also to answer your "can these algos be coded in GPU" the answer is yes, for some, e.g. there is a nice formulation of PageRank using matrix decomposition and GPUs help a lot there. — Kostas Mouratidis, Dec 08 '20 at 21:52
@KostasMouratidis I've looked into cuGraph, that's on Linux, but since I am on Windows I cannot use it as it is not available for Windows users for now. I'll look into your other suggestions. Thanks. — thepunitsingh, Dec 08 '20 at 21:53

Gabor Szarnyas · Accepted Answer · 2021-02-10T19:56:57.100

This is a broad but interesting question. Let me try to answer it.

2000 undirected simple graphs [...] Each graph has approx 40,000 nodes and approx 600,000 edges

Currently, I am using NetworkX for my analysis and currently running nx.algorithms.cluster.average_clustering(G) and nx.average_shortest_path_length(G)

NetworkX uses plain Python implementations and is not optimized for performance. It's great for prototyping but if you encounter performance issues, it's best to look to rewrite your code using another library.

Other than NetworkX, the two most popular graph processing libraries are igraph and SNAP. Both are written in C and have Python APIs so you get both good single-threaded performance and ease of use. Their parallelism is very limited but this is not a problem in your use case as you have many graphs, rendering your problem embarrassingly parallel. Therefore, as you remarked in the updated question, you can run 6-8 jobs in parallel using e.g. Joblib or even xargs. If you need parallel processing, look into graph-tool, which also has a Python API.

Regarding your NetworkX algorithms, I'd expect the average_shortest_path_length to be reasonably well-optimized in all libraries. The average_clustering algorithm is tricky as it relies on node-wise triangle counting and a naive implementation takes O(|E|^2) time while an optimized implementation will do it in O(|E|^1.5). Your graphs are large enough so that the difference between these two costs is running the algorithm on a graph in a few seconds vs. running the algorithm for hours.

The "all-pairs shortest paths" (APSP) problem is very time-consuming, with most libraries using the Floyd–Warshall algorithm that has a runtime of O(|V|^3). I'm unsure what output you're looking for with the "All Possible Paths between two nodes" algorithm – enumerating all paths leads to an exponential amount of results and is unfeasible at this scale.

I would not start using the GPU for this task: an Intel i7-9700K should be up for this job. GPU-based graph processing libraries are challenging to set up and currently do not provide that significant of a speedup – the gains by using a GPU instead of a CPU are nowhere near as significant for graph processing as for machine learning algorithms. The only problem where you might be able to get a big speedup is APSP but it depends on which algorithms your chosen library uses.

If you are interested in GPU-based libraries, there are promising directions on the topic such as Gunrock, GraphBLAST, and a work-in-progress SuiteSparse:GraphBLAS extension that supports CUDA. However, my estimate is that you should be able to run most of your algorithms (barring APSP) in a few hours using a single computer and its CPU.

Thanks for your detailed answer. With so much trial and error, I have also reached to a similar conclusion to use `python igraph` with `joblib` and I am getting a good efficiency, so I am marking your answer as accepted. For now, I am not calculating "All Possible Paths between two nodes" as it is not feasible, but I am calculating all shortest path. I'll look into other options that you have provided in the answer. Thanks a lot. — thepunitsingh, Feb 12 '21 at 21:47
@GaborSzarnyas Damn. I am using NetworkX to compute Graph Edit Distance between multiple graphs at a time. It takes for ever on my cpu and I was wanting to find a way to utilise the GPU. I guess there isn't a way according to what you were saying? Would you happen to know of any other ways I could calculate the GED for many pairs? — BanAckerman, Aug 10 '22 at 15:17
You will not be able to use the GPU without a dedicated library. I'd recommend trying another Python library such as https://github.com/jajupmochi/graphkit-learn, hoping that it has a more efficient implementation of Graph Edit Distance than NetworkX. Or, you can try using a C++-based implementation. This project seems reasonably well-documented albeit abandoned: https://github.com/dbblumenthal/gedlib. — Gabor Szarnyas, Aug 11 '22 at 09:50

score 1 · Answer 2 · answered Jun 21 '23 at 03:19

I think GraphScope can fully satisfy your requirements, by offering user-friendly Python interfaces that are compatible with NetworkX and an efficient distributed runtime implemented in C++. That is, users only need to modify their NetworkX applications with a few lines of code, while achieving several orders of magnitude performance improvement. To make NetworkX applications run on GraphScope, users only need to replace import network with import graphscope.nx as networkx, since the graph manipulation and data loading interfaces of GraphScope are fully compatible with NetworkX.

The runtime of GraphScope is implemented with C++ for high efficiency. Taking the clustering algorithm as an example, running it on GraphScope is over 29x faster than running over NetworkX in our testbed. Furthermore, GraphScope enables running NetworkX applications in a distributed manner, allowing for high scalability. To run NeteorkX applications in a distributed fashion on a K8s cluster, users need to replace import graphscope.nx as nwtworkx with

import graphscope
networkx = graphscope.session(num_workers=$NUM_WORKERS).nx()

For more information about how to run NetworkX applications on GraphScope, please check out Analyzing graph with GraphScope in the Style of NetworkX.

Disclaimer: I'm an author of GraphScope.

Efficient way to speeding up graph theory and complex network algorithms on CPU/GPU using Python?

2 Answers2