This is the Connected Components example by graphframe
:
from graphframes.examples import Graphs
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select("id", "component").orderBy("component").show()
In the document, they said:
NOTE: With GraphFrames 0.3.0 and later releases, the default Connected Components algorithm requires setting a Spark checkpoint directory. Users can revert to the old algorithm using connectedComponents.setAlgorithm("graphx").
So this is my full code connected.py
with setCheckpointDir
:
import pyspark
sc = pyspark.SparkContext().getOrCreate()
sc.addPyFile("/home/username/.ivy2/jars/graphframes_graphframes-0.8.1-spark3.0-s_2.12.jar")
from graphframes.examples import Graphs
sc.setCheckpointDir("graphframes_cps")
g = Graphs(sqlContext).friends() # Get example graph
result = g.connectedComponents()
result.select("id", "component").orderBy("component").show()
And run with this command:
spark-submit connected.py --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
Then it returns this error:
Traceback (most recent call last):
File "/home/username//test/spark/connected.py", line 11, in <module>
sc.setCheckpointDir("graphframes_cps")
File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 975, in setCheckpointDir
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1304, in __call__
File "/opt/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 326, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o19.setCheckpointDir.
How can I fix this?