I have a graph and that consists of vertices
and edges
and I am using graphframes library to find connected components of that graph.
import GraphFrames as gf
connected_components = gf.GraphFrame(vertices, edges).connected_components(algorithm="graphx")
On small datasets this works perfectly well, however when I try to apply it to the real data I ran into a problem. Since most of the elements belong to one connected component, the algorythm starts iterating and puts more and more data on a single executor. And of course, I have 10M+ vertices (with quite a few attributes attached to each of them), this leads to executor failure. I have been trying to increase the the size of executors, but I have reached my limit and I cannot encrease it any further.
I have been searching over the internet and found this problem, but I am using Python and cannot develop on scala, so I was thinking to make a huge driver and try a brut force RDD approach (using ChatGPT):
def connected_components(vertices, edges, clients_only: bool=True):
"""calculates connected components"""
# Convert vertices DataFrame to RDD
if clients_only:
vertices = vertices.filter("is_client")
vertices_rdd = vertices.rdd.map(lambda row: row.id)
# Convert edges DataFrame to RDD of dictionaries
edges_rdd = edges.rdd.map(lambda row: (row.src, row.dst))
edges_dict_rdd = edges_rdd.groupByKey().mapValues(list)
# calculate connected components
cp = connected_components_rdd(vertices_rdd, edges_dict_rdd)
return cp.toDF(["id", "component"])
def connected_components_rdd(vertices, edges):
"""
rdd approach proposed by ChatGpt. It is a brute force approach
since the graphframes algorithm tends to put all data for connected components
into one place
"""
# Create initial RDD with each vertex having its own component ID
rdd = vertices.map(lambda v: (v, v))
# Define DFS function to label vertices with component ID
def dfs(vertex, component_id):
# Mark vertex as visited and assign it the component ID
visited.add(vertex)
components[vertex] = component_id
# Recursively label all adjacent unvisited vertices
for neighbor in edges_dict.get(vertex, []):
if neighbor not in visited:
dfs(neighbor, component_id)
edges_dict = edges.collectAsMap()
# Iterate through unvisited vertices and perform DFS
visited = set()
components = {}
component_id = 0
for vertex in vertices.collect():
if vertex not in visited:
dfs(vertex, component_id)
component_id += 1
# Create RDD with final component IDs for each vertex
rdd = rdd.map(lambda values: (values[0], components[values[0]]))
return rdd
I do not really like this solution since it requres to collect roufly all data into my driver. So I was thinkg if there were another way to get the connected components, but without a drawback of having to collect the data either on the driver or on an executor?