5

I am trying to compute the shortest path in a large network from a given source to a given target based on weights unsing Apache Spark. Since all my other code is written in python I don't wanna change. It should be somehow possible, shoundn't it? Since I am quite new to Spark maybe I dont see how I can solve the problem.

Maybe someone can help me out? Thanks in advance!

What I tried so far:

  • creating a vertex and edge list
  • using GraphFrame() to create a graph
  • unsing the GraphFrames shortest path method to compute the shortest path

So far so good (not really). The problem with the GraphFrames shortest path method is that it computes the shortest path from every node to the given set of nodes, which works for small graphs but takes ages for huge networks. A lot of "unnecessary" computation is done because all nodes are considered. I just need to get the shortest path from one node to another one.

I was searching the internet and found that the Spark graphx library has such a function I am looking for, but sadly its only available for Scala...

Maybe I just can use the rdds to compute the shortest path based on weights? Or is there an shortest path implementation for pyspark I wasn't able to find? Can't believe that no shortest path algorithm is implemented for pyspark.

    vertices_rdd = vertices_rdd3.zipWithIndex()
    # vertices_rdd.take(3): 
    # [((552897.813699282, 4164322.19502139), 0), ((583743.487097408, 4158379.86761575), 1), ((585964.589845657, 4158443.96863072), 2)]

    edges_rdd = edges_rdd1.flatMap(lambda x: x)
    # edges_rdd.take(3): 
    # [(62734, 107857, 102.19468251940246, '8'), (107857, 62734, 102.19468251940246, '8'), (79903, 191109, 21.81675476329727, '13')]

    spark = SparkSession(sc)

    vertices_df = vertices_rdd.toDF(["coordinate","id"])
    edges_df = edges_rdd.toDF(["src", "dst", "distance", "streetclass"])

    vertices_df.show()
    #+--------------------+---+
    #|          coordinate| id|
    #+--------------------+---+
    #|[552897.813699282...|  0|
    #|[583743.487097408...|  1|
    #|[585964.589845657...|  2|
    #|[588646.795215483...|  3|
    #|[582405.137425844...|  4|
    #|[582823.612980657...|  5|
    #...

    edges_df.show()
    #+------+------+------------------+-----------+
    #|   src|   dst|          distance|streetclass|
    #+------+------+------------------+-----------+
    #| 62734|107857|102.19468251940246|          8|
    #|107857| 62734|102.19468251940246|          8|
    #| 79903|191109| 21.81675476329727|         13|
    #|191109| 79903| 21.81675476329727|         13|
    #| 60790| 66205|19.362434806339824|         13|
    #... 

    from graphframes import *
    g = GraphFrame(vertices_df, edges_df)

    results = g.shortestPaths(landmarks=["0"])
    results.select("id", "distances").show()
    #+---+-----------+
    #| id|  distances|
    #+---+-----------+
    #|  0|Map(0 -> 0)|
    #|  7|Map(0 -> 1)|
    #|  6|      Map()|
    #|  9|      Map()|
    #|  5|      Map()|
    #|  1|      Map()|
    #|  3|      Map()|
    #|  8|      Map()|
    #...
JustSomeone
  • 171
  • 2
  • 12

0 Answers0