Problem
I have a Graphframes graph, from which I've obtained the connected components. Now, I would like to find the distance from a source node to a target node, both pertaining to the same component.
id_src | id_dst | component |
---|---|---|
123 | 657 | 1 |
234 | 876 | 2 |
876 | 567 | 2 |
I would like to calculate the distance from id_src
to id_dst
for each row in this DataFrame, so the result would look like:
id_src | id_dst | component | distance |
---|---|---|---|
123 | 657 | 1 | 4 |
234 | 876 | 2 | 2 |
876 | 567 | 2 | 2 |
I know I need to use the BFS function from Graphframes, but can't find the way to make it parallel and provide the source and destination id for each row.
What I've tried
- I've tried to do it through a UDF, with no luck.
from math import floor
from pyspark.sql.functions import udf
from pyspark.sql.types import *
@udf
def shortest_path(x):
if x[1] is not None:
path = g.bfs(f"id = '{x[0]}'", f"id = '{x[1]}'", maxPathLength=4)
return floor((len(path.columns)-2)/2) + 1
result = vars.select(shortest_path(F.struct("id_src", "id_dst")))
This results in the following exception, I understand it's because I can't parallelize an already parallel function:
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
- I also thought about using a non-parallel library like networkx or igraph, creating a single graph from each connected component. The problem is I don't know how to generate these single graphs and then reference them from the udf.
Any ideas are appreciated, thank you