0

Problem

I have a Graphframes graph, from which I've obtained the connected components. Now, I would like to find the distance from a source node to a target node, both pertaining to the same component.

id_src id_dst component
123 657 1
234 876 2
876 567 2

I would like to calculate the distance from id_src to id_dst for each row in this DataFrame, so the result would look like:

id_src id_dst component distance
123 657 1 4
234 876 2 2
876 567 2 2

I know I need to use the BFS function from Graphframes, but can't find the way to make it parallel and provide the source and destination id for each row.

What I've tried

  1. I've tried to do it through a UDF, with no luck.
from math import floor
from pyspark.sql.functions import udf
from pyspark.sql.types import *

@udf
def shortest_path(x):
    if x[1] is not None:
        path = g.bfs(f"id = '{x[0]}'", f"id = '{x[1]}'", maxPathLength=4)
        return floor((len(path.columns)-2)/2) + 1

result = vars.select(shortest_path(F.struct("id_src", "id_dst")))

This results in the following exception, I understand it's because I can't parallelize an already parallel function:

Traceback (most recent call last):
  File "/databricks/spark/python/pyspark/serializers.py", line 476, in dumps
    return cloudpickle.dumps(obj, pickle_protocol)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 72, in dumps
    cp.dump(obj)
  File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
    return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
  1. I also thought about using a non-parallel library like networkx or igraph, creating a single graph from each connected component. The problem is I don't know how to generate these single graphs and then reference them from the udf.

Any ideas are appreciated, thank you

itscarlayall
  • 128
  • 1
  • 14

0 Answers0