Python's graph tool - computing ego net with multiprocessing

Question

I'm trying to compute ego-nets for 250k special nodes residing inside a relatively big network (8M nodes and 17M edges). Since the cutting process takes 3 seconds for each special node, I decided to use multiprocessing:

from graph_tool import Graph, GraphView
from graph_tool.topology import shortest_distance
from graph_tool import load_graph
import multiprocessing
import time

NO_PROC = 4
DEGREE = 4 # neighbours of n-th degree
NO_SPECIAL_NODES = 250000

graph = load_graph('./graph.graphml') #8M nodes, 17M edges

def ego_net(g, ego, n):
    print("Graph's identity: {}.".format(id(g))) # check if you use the same object as in other calls
    d = shortest_distance(g=g, source=ego, max_dist=n) #O(V+E)
    u = GraphView(g, vfilt=d.a < g.num_vertices()) #O(V)
    u = Graph(u, prune=True)
    return (ego, u)

if __name__ == "__main__":
    # generate arguments
    data = [(graph, node, DEGREE) for node in range(0, NO_SPECIAL_NODES)]

    # choose forking strategy explicitly
    ctx = multiprocessing.get_context('fork')
    pool = ctx.Pool(NO_PROC)

    results = pool.starmap(ego_net, [piece for piece in data])

The problem with this approach is that, even though I explicitly choose fork approach, the graph object is not used by subprocesses but instead the big object is being copied into each of the subprocesses. This approach results in MemoryError since I can't provide enough RAM for copies of the graph.

I know there are data structures defined in multiprocessing that support sharing between processes. They don't seem to support complex objects such as Graph that graph is an instance of. Is there any way to load the graph only once and let it be used by all the processes? I'm sure that ego_net function reads from the graph and doesn't modify object in any way.

I do not believe that what you want to do is strictly speaking possible. Part of the advantage of using a multiprocessing model is that you avoid GIL (global interpreter lock) contention. If you were to share complex Python objects between processes, they would have to share a GIL, which would defeat the purpose. I'll let someone more knowledgeable recommend a solution, however; I could be mistaken. — Ken Kinder, Jan 30 '20 at 09:11

score 0 · Answer 1 · answered Jun 28 '23 at 12:39

Since you want to compute the shortest distance on a big network graph in parallel. You might be interested in GraphScope. This is the description:

GraphScope is a unified distributed graph computing platform that provides a one-stop environment for performing diverse graph operations on a cluster of computers through a user-friendly Python interface. GraphScope makes multi-staged processing of large-scale graph data on compute clusters simple by combining several important pieces of Alibaba technology: including GRAPE, GraphCompute, and Graph-Learn (GL) for analytics, interactive, and graph neural networks (GNN) computation, respectively.

Here is the quickly started of GraphScope:

# install graphscope by pip
pip3 install graphscope

>>> import graphscope
>>> graphscope.set_option(show_log=True)
>>>
>>> # load graph
>>> from graphscope.dataset import load_p2p_network
>>> g = load_p2p_network()
>>>
>>> # run sssp algorithm
>>> pg = g.project(vertices={"host": ["id"]}, edges={"connect": ["dist"]})
>>> c = graphscope.sssp(pg, src=6)

Refer to How to Run and Develop GraphScope Locally and GraphScope Doc for more information.

Python's graph tool - computing ego net with multiprocessing

1 Answers1