How to get list of graph nodes after using connectedComponents of pyspark

Question

I am learning PySpark in Python. If I use the below line of code to get components from my graph, then one column would be added to my GraphDataFrame with the component (random number). But I am curious is it possible to get a list of nodes that are connected?

g.connectedComponents()

score 0 · Answer 1 · answered Apr 09 '22 at 15:48

result is just a normal data frame, that you can group by component, and then collect results as list using the collect_list function (doc). For example, using the example graph from graphframes:

from graphframes.examples import Graphs
import pyspark.sql.functions as F

sc.setCheckpointDir("/tmp/spark-checkpoint")

g = Graphs(sqlContext).friends()
df = g.connectedComponents()

# getting the list of IDs per component
df2 = df.select("id", "component").groupBy("component") \
  .agg(F.collect_list("id"))
df2.show()

will give:

+------------+------------------+
|   component|  collect_list(id)|
+------------+------------------+
|412316860416|[a, b, c, d, e, f]|
+------------+------------------+

Is it possible to get list of lists from the the above graphframe? Something like [[a, b, c, d, e,f]] — ffl, Apr 09 '22 at 23:40
yes, you can call `.collect()` on the result of it. But it work only for small graphs — Alex Ott, Apr 10 '22 at 08:51

How to get list of graph nodes after using connectedComponents of pyspark

1 Answers1