1

I am learning PySpark in Python. If I use the below line of code to get components from my graph, then one column would be added to my GraphDataFrame with the component (random number). But I am curious is it possible to get a list of nodes that are connected?

g.connectedComponents()
ffl
  • 91
  • 1
  • 4

1 Answers1

0

result is just a normal data frame, that you can group by component, and then collect results as list using the collect_list function (doc). For example, using the example graph from graphframes:

from graphframes.examples import Graphs
import pyspark.sql.functions as F

sc.setCheckpointDir("/tmp/spark-checkpoint")

g = Graphs(sqlContext).friends()
df = g.connectedComponents()

# getting the list of IDs per component
df2 = df.select("id", "component").groupBy("component") \
  .agg(F.collect_list("id"))
df2.show()

will give:

+------------+------------------+
|   component|  collect_list(id)|
+------------+------------------+
|412316860416|[a, b, c, d, e, f]|
+------------+------------------+
Alex Ott
  • 80,552
  • 8
  • 87
  • 132
  • Is it possible to get list of lists from the the above graphframe? Something like [[a, b, c, d, e,f]] – ffl Apr 09 '22 at 23:40
  • yes, you can call `.collect()` on the result of it. But it work only for small graphs – Alex Ott Apr 10 '22 at 08:51