0

I am trying to execute some lambda per connected component in graphx of Spark. I get connected components using connectedComponents() method, but then I couldn't find any other way except collecting all distinct vertex ids of the graph with labels assigned to components, and then doing foreach, and getting each component using subgraph() method. But this is sequential process and if my graph has a lot of small components it's not scalable. Can someone help me? Is there a way to say something like connectedComponentsGraph.foreachComponent(lambda)?

1 Answers1

1

I'd recommend using graphframes:

 import org.graphframes._

 val graph: Graph = ???
 val gdf = GraphFrame.fromGraphX(graph)
 val components = gdf.connectedComponents.setAlgorithm("graphx").run()

and follow up with basic SQL:

components
  .join(gdf.vertices, Seq("id"))
  .join(gdf.edges.select($"src" as "id"), Seq("id"))
  .groupBy("component")
  .count
Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Thank you so much! You gave me a clue and I did basically the same not even switching to GraphFrame. Btw, why do you recommend it (apart from the fact its newer)? – Viacheslav Inozemtsev Jan 11 '18 at 08:31
  • Another question, maybe you could help, do you know how to specify the number of iterations for connectedComponents() method? I have some synthetic tests and they all require different number of iterations. What could be the strategy here? – Viacheslav Inozemtsev Jan 11 '18 at 08:32