0

I am currently using GraphFrames to retrieve connected components from a graph.

My code is very simple as follows:

v = sqlContext.createDataFrame(node,["id","name"])
print v.take(15)
e = sqlContext.createDataFrame(edge,["src","dst"])
print e.take(15)
g = GraphFrame(v,e)
# NullPointerException comes from connectedComponents function
res = g.connectedComponents()

Below is the output of the code snippet, which seems alright to me too.

Print Vertices:

[Row(id=6, name=u'6'), Row(id=12, name=u'12'), Row(id=1, name=u'1'), Row(id=3, name=u'3'), Row(id=9, name=u'9'), Row(id=2, name=u'2'), Row(id=11, name=u'11'), Row(id=10, name=u'10'), Row(id=5, name=u'5'), Row(id=4, name=u'4')]

Print Edges:

[Row(src=2, dst=9), Row(src=2, dst=5), Row(src=2, dst=6), Row(src=9, dst=10), Row(src=11, dst=12), Row(src=4, dst=10), Row(src=1, dst=2), Row(src=1, dst=3), Row(src=1, dst=12)]

However, when g.connectedComponents() is executed, the program starts to give the following NullPointerException.

Would appreciate any suggestions on what's going wrong here!

ERROR LiveListenerBus: Listener JobProgressListener threw an exception java.lang.NullPointerException at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:361) at org.apache.spark.ui.jobs.JobProgressListener$$anonfun$onTaskEnd$1.apply(JobProgressListener.scala:360) at scala.collection.immutable.List.foreach(List.scala:318) at scala.collection.generic.TraversableForwarder$class.foreach(TraversableForwarder.scala:32) at scala.collection.mutable.ListBuffer.foreach(ListBuffer.scala:45) at org.apache.spark.ui.jobs.JobProgressListener.onTaskEnd(JobProgressListener.scala:360) at org.apache.spark.scheduler.SparkListenerBus$class.onPostEvent(SparkListenerBus.scala:42) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.scheduler.LiveListenerBus.onPostEvent(LiveListenerBus.scala:31) at org.apache.spark.util.ListenerBus$class.postToAll(ListenerBus.scala:55) at org.apache.spark.util.AsynchronousListenerBus.postToAll(AsynchronousListenerBus.scala:37) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(AsynchronousListenerBus.scala:80) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(AsynchronousListenerBus.scala:65) at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57) at org.apache.spark.util.AsynchronousListenerBus$$anon$1$$anonfun$run$1.apply$mcV$sp(AsynchronousListenerBus.scala:64) at org.apache.spark.util.Utils$.tryOrStopSparkContext(Utils.scala:1183) at org.apache.spark.util.AsynchronousListenerBus$$anon$1.run(AsynchronousListenerBus.scala:63)

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
shu
  • 89
  • 1
  • 8
  • I'm not familiar with this specific error, but to make sure: did you set the checkPointDir? `spark.sparkContext.setCheckpointDir` – Tom Lous Jun 13 '17 at 10:20
  • The checkPointDir is not set, it seems that this is required for RDD.checkpoint to work. Do you mean that g. connectedComponents() involves some kind of checkpoint calls? I am quite new to spark and graphFrames, would you please explain a bit further why this might be needed? Thank you! – shu Jun 13 '17 at 12:30
  • Well I guess the reason is that connected components in graphs are NP-hard problems and without checkpoints the RDD dependencies are to deep or require huge reshuffles, probably resulting in OOMs. Also https://graphframes.github.io/user-guide.html#connected-components mentions it in the documentation – Tom Lous Jun 13 '17 at 12:35
  • Thanks for the explanation! I added sc.setCheckpointDir("checkpoints") in my code, but still can't get rid of the exception:(.. I noticed a new folder is created inside my "checkpoints" directory when the job is launched, but the folder is always empty. How can I get to know if the checkpoints have worked? – shu Jun 14 '17 at 02:49
  • I'm guessing the code doesn't even make it that far. Maybe it's unrelated to the checkpoint. My next guess would be something to do with the webui based on the error. Maybe disabling it helps? `.config("spark.ui.enabled", "false")` – Tom Lous Jun 14 '17 at 07:30

0 Answers0