0

I have a huge graph (say for example 300,000 nodes and 1,000,000 edges) which I'm analyzing using Python on an Ubuntu machine with 32GB of RAM and 4 CPU cores.

I found graph-tool to be a very efficient tool for the measurement of betweenness centrality (weighted version), much faster than Networkx. However, the problem is that loading such a huge graph in memory kills my application (out-of-memory).

For this reason, I was thinking switching to Neo4j, to store the graph and calculate betweenness centrality.

Can you help me with the following questions?

  1. Will Neo4j allow me the direct calculation of weighted betweenness centrality (shortest paths are computed considering edge weight), with the possibility of passing the results for every node to Python?
  2. Will the use of Neo4j for the calculation save me from the Out-of-Memory kill? Or will the problem persist?
  3. I could not find any performance comparison. Is the calculation of betweenness faster in graph-tools or in Neo4j? How much is the difference?
  4. Is there a better solution to my problem that I did not consider?
Forinstance
  • 413
  • 4
  • 17
  • neo4j does allow betweenness calculation. There is also algorithm which is faster but provides approximate scores. Try it out: https://neo4j.com/docs/graph-algorithms/current/algorithms/betweenness-centrality/ You can also try py2neo for streaming the results. Try out and share more details if you face a problem. – Himanshu Jain Oct 29 '18 at 18:53
  • Thanks, from their description it seems to use arc weights, even if I was not 100% sure. Switching to Neo4j is not that easy, so I'm first trying to know if this is worth in terms of memory use and performance. – Forinstance Oct 30 '18 at 07:24
  • 1
    Yes, even I couldn't find the comparison online. What kind of data do you have? There are ways to import data into Neo4j so the transition should be okay. If you share sample data, we can figure out how to import and export. It is fairly easy. One of the options is dgraph but I haven't personally used it and it's comparatively new. dgraph vs neo4j: https://blog.dgraph.io/post/benchmark-neo4j/ Also, Neo4j is making a move towards a more commercial license for the enterprise version instead of the AGPL, so if you are using it for production, I would consider using some other alternative. – Himanshu Jain Oct 30 '18 at 18:22
  • Thanks a lot for pointing me to dgraph which I did not know. Here is a sample data, in Pajek .net format. What I have in reality is similar, but huge. https://ufile.io/1unji – Forinstance Oct 31 '18 at 06:33

1 Answers1

0

Since you need to rweighted betweenness centrality algorithm on a huge graph, you might be interested in GraphScope. This is the description:

GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simple by combining several important pieces of Alibaba technology: including GRAPE, GraphCompute, and Graph-Learn (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively.

Here is the quickly started of GraphScope:

# install graphscope by pip
pip3 install graphscope
>>> import graphscope
>>> graphscope.set_option(show_log=True)
>>>
>>> # load graph
>>> from graphscope.dataset import load_p2p_network
>>> g = load_p2p_network()
>>>
>>> # run betweenness_centrality algorithm
>>> pg = g.project(vertices={"host": []}, edges={"connect": []})
>>> c = graphscope.flash.betweenness_centrality(pg, source=1)

Refer to How to Run and Develop GraphScope Locally and GraphScope Doc for more information.

lidongze
  • 11
  • 2