0

I'm having an issue in my code where I'm recieving a null pointer exception runtime error when mapping a function that calls shortest path on a global graph variable. For some reason, even though initializing distance in the terminal regularly throws no error, and calling testF() normally works as well, it doesn't work when its getting mapped. When i remove the eroneous distance call inside the testF function, the example works fine. Does anyone know why this is happening?

val testG = Graph.fromEdges[Int, Int](sc.parallelize(List(Edge(1, 2, 1), Edge(2, 3, 1))), 0)
val testRDD = sc.parallelize(List(1, 2, 3, 4))
def testF() : Int = {
     val distances = ShortestPaths.run(testG, Seq(15134567L))
     return 5
}
testF() //works fine and returns 5
val testR = testRDD.map{case(num) => (num, test())}
testR.take(10).foreach(println) //gives a null pointer error
mt88
  • 2,855
  • 8
  • 24
  • 42
  • I believe you can't access a distributed object (RDD, DataFrame, Graph, etc.) from inside a distributed map operation. I think you probably need to re-think your logic. If your data inside `testRDD` is small enough, you may consider collecting it into the driver and using broadcast – Daniel de Paula May 12 '16 at 01:29
  • Gotcha. That makes sense. So if i have a list of vertex pairs from a graph G and I want to compute the shortest paths between all these pairs, am I not able to? Is there a way around this? – mt88 May 12 '16 at 01:32
  • If the list is small enough, you may consider collecting it into the driver and using broadcast, or even passing it directly to `ShortestPaths` via the second argument – Daniel de Paula May 12 '16 at 01:33
  • Thanks for the help. My list isn't small, but the subgraphs are. – mt88 May 12 '16 at 01:35
  • So I'm afraid I can't think of a quick solution for your problem now. I would have to spend some time on it. – Daniel de Paula May 12 '16 at 01:37
  • Yeah no sweats. This already cleared some stuff up. – mt88 May 12 '16 at 01:38
  • The general rule is that often what you are trying to do by nesting an `RDD` function inside of an `RDD` function (and which you can't do, for the reasons Daniel said) you can accomplish the same thing with a `join` or possibly a `cogroup` – David Griffin May 12 '16 at 16:52

1 Answers1

1

As @DanieldePaula alluded to - you can not nest the distributed methods within the RDD's. Instead the logic within the ShortestPaths.run would need to be extracted and reformulated as straight scala code - and without any mention of sc (SparkContext) methods, SparkJob, or any other Driver-only mechanisms. You need to stick with serializable and Worker-compatible logic.

WestCoastProjects
  • 58,982
  • 91
  • 316
  • 560