I'm trying to compute ego-nets for 250k special nodes residing inside a relatively big network (8M nodes and 17M edges). Since the cutting process takes 3 seconds for each special node, I decided to use multiprocessing
:
from graph_tool import Graph, GraphView
from graph_tool.topology import shortest_distance
from graph_tool import load_graph
import multiprocessing
import time
NO_PROC = 4
DEGREE = 4 # neighbours of n-th degree
NO_SPECIAL_NODES = 250000
graph = load_graph('./graph.graphml') #8M nodes, 17M edges
def ego_net(g, ego, n):
print("Graph's identity: {}.".format(id(g))) # check if you use the same object as in other calls
d = shortest_distance(g=g, source=ego, max_dist=n) #O(V+E)
u = GraphView(g, vfilt=d.a < g.num_vertices()) #O(V)
u = Graph(u, prune=True)
return (ego, u)
if __name__ == "__main__":
# generate arguments
data = [(graph, node, DEGREE) for node in range(0, NO_SPECIAL_NODES)]
# choose forking strategy explicitly
ctx = multiprocessing.get_context('fork')
pool = ctx.Pool(NO_PROC)
results = pool.starmap(ego_net, [piece for piece in data])
The problem with this approach is that, even though I explicitly choose fork
approach, the graph
object is not used by subprocesses but instead the big object is being copied into each of the subprocesses. This approach results in MemoryError
since I can't provide enough RAM for copies of the graph
.
I know there are data structures defined in multiprocessing
that support sharing between processes. They don't seem to support complex objects such as Graph
that graph
is an instance of. Is there any way to load the graph
only once and let it be used by all the processes? I'm sure that ego_net
function reads from the graph
and doesn't modify object in any way.