Graphframes - Distance between vertices in the same connected component

Question

Problem

I have a Graphframes graph, from which I've obtained the connected components. Now, I would like to find the distance from a source node to a target node, both pertaining to the same component.

id_src	id_dst	component
123	657	1
234	876	2
876	567	2

I would like to calculate the distance from id_src to id_dst for each row in this DataFrame, so the result would look like:

id_src	id_dst	component	distance
123	657	1	4
234	876	2	2
876	567	2	2

I know I need to use the BFS function from Graphframes, but can't find the way to make it parallel and provide the source and destination id for each row.

What I've tried

I've tried to do it through a UDF, with no luck.

from math import floor
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf
def shortest_path(x):
    if x[1] is not None:
        path = g.bfs(f"id = '{x[0]}'", f"id = '{x[1]}'", maxPathLength=4)
        return floor((len(path.columns)-2)/2) + 1

result = vars.select(shortest_path(F.struct("id_src", "id_dst")))

This results in the following exception, I understand it's because I can't parallelize an already parallel function:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object

I also thought about using a non-parallel library like networkx or igraph, creating a single graph from each connected component. The problem is I don't know how to generate these single graphs and then reference them from the udf.

Any ideas are appreciated, thank you

why is the first one has the distance 4? – Lamanus Feb 15 '23 at 05:05 — Lamanus, Feb 15 '23 at 05:05

Graphframes - Distance between vertices in the same connected component

Problem

What I've tried

0 Answers0