result is just a normal data frame, that you can group by component
, and then collect results as list using the collect_list
function (doc). For example, using the example graph from graphframes:
from graphframes.examples import Graphs
import pyspark.sql.functions as F
sc.setCheckpointDir("/tmp/spark-checkpoint")
g = Graphs(sqlContext).friends()
df = g.connectedComponents()
# getting the list of IDs per component
df2 = df.select("id", "component").groupBy("component") \
.agg(F.collect_list("id"))
df2.show()
will give:
+------------+------------------+
| component| collect_list(id)|
+------------+------------------+
|412316860416|[a, b, c, d, e, f]|
+------------+------------------+