1

I have a graph and that consists of vertices and edges and I am using graphframes library to find connected components of that graph.

import GraphFrames as gf

connected_components = gf.GraphFrame(vertices, edges).connected_components(algorithm="graphx")

On small datasets this works perfectly well, however when I try to apply it to the real data I ran into a problem. Since most of the elements belong to one connected component, the algorythm starts iterating and puts more and more data on a single executor. And of course, I have 10M+ vertices (with quite a few attributes attached to each of them), this leads to executor failure. I have been trying to increase the the size of executors, but I have reached my limit and I cannot encrease it any further.

I have been searching over the internet and found this problem, but I am using Python and cannot develop on scala, so I was thinking to make a huge driver and try a brut force RDD approach (using ChatGPT):

def connected_components(vertices, edges, clients_only: bool=True):
    """calculates connected components"""

    # Convert vertices DataFrame to RDD
    if clients_only:
        vertices = vertices.filter("is_client")
    vertices_rdd = vertices.rdd.map(lambda row: row.id)
    # Convert edges DataFrame to RDD of dictionaries
    edges_rdd = edges.rdd.map(lambda row: (row.src, row.dst))
    edges_dict_rdd = edges_rdd.groupByKey().mapValues(list)
    # calculate connected components    
    cp = connected_components_rdd(vertices_rdd, edges_dict_rdd)
    return cp.toDF(["id", "component"])


def connected_components_rdd(vertices, edges):
    """
    rdd approach proposed by ChatGpt. It is a brute force approach
    since the graphframes algorithm tends to put all data for connected components
    into one place
    """
    # Create initial RDD with each vertex having its own component ID
    rdd = vertices.map(lambda v: (v, v))

    # Define DFS function to label vertices with component ID
    def dfs(vertex, component_id):
        # Mark vertex as visited and assign it the component ID
        visited.add(vertex)
        components[vertex] = component_id
        # Recursively label all adjacent unvisited vertices
        for neighbor in edges_dict.get(vertex, []):
            if neighbor not in visited:
                dfs(neighbor, component_id)

    edges_dict = edges.collectAsMap()
    # Iterate through unvisited vertices and perform DFS
    visited = set()
    components = {}
    component_id = 0
    for vertex in vertices.collect():
        if vertex not in visited:
            dfs(vertex, component_id)
            component_id += 1

    # Create RDD with final component IDs for each vertex
    rdd = rdd.map(lambda values: (values[0], components[values[0]]))

    return rdd

I do not really like this solution since it requres to collect roufly all data into my driver. So I was thinkg if there were another way to get the connected components, but without a drawback of having to collect the data either on the driver or on an executor?

Grigory Sharkov
  • 121
  • 1
  • 8

0 Answers0