1

I'm fairly new to spark and GraphX, and I'm trying to understand how to perform the following operation using GraphX's Java APIs. I'm looking to produce a method with the following signature:

private <List<Graph<VD, ED>> computeConnectedComponents(Graph<VD, ED> graph){}

Where, given a graph with only positive degree nodes, but an unknown number of connected components, it should return a list (order doesn't matter) of graphs, where each graph is connected.

I am aware of GraphOps.connectedComponents() and ConnectedComponents.run(), but I am struggling to understand the return values. The docs list them as returning a graph of Graph<Object, ED> and say something about the "lowest vertex id" being returned.

Basically, I am wondering what I could do to derive this list of Graphs from the return value of connectedComponents and my initial graph.

Danimosity
  • 11
  • 2

1 Answers1

2

The following code is in scala, but should demonstrate the idea.

The returned graph will contain all the vertices, but each vertex's attribute is replaced with a VertexId (really just a Long), which can be interpreted as the id of the connected component that the vertex belongs to. It is also the "lowest vertex id" belonging to the that connected component.

import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD
val vertexArray = Array(
  (1L, ("A", 28)),
  (2L, ("B", 27)),
  (3L, ("C", 65)),
  (4L, ("D", 42)),
  (5L, ("E", 55)),
  (6L, ("F", 50)),
  (7L, ("G", 53)),
  (8L, ("H", 66))
  )

// Vertices 1 - 6 are connected, 7 and 8 are connected.
val edgeArray = Array(
  Edge(2L, 1L, 7),
  Edge(2L, 4L, 2),
  Edge(3L, 2L, 4),
  Edge(3L, 6L, 3),
  Edge(4L, 1L, 1),
  Edge(5L, 2L, 2),
  Edge(5L, 3L, 8),
  Edge(5L, 6L, 3),
  Edge(7L, 8L, 3)
  )

val vertexRDD: RDD[(Long, (String, Int))] = sc.parallelize(vertexArray)
val edgeRDD: RDD[Edge[Int]] = sc.parallelize(edgeArray)
val graph: Graph[(String, Int), Int] = Graph(vertexRDD, edgeRDD)

val cc = graph.connectedComponents().vertices.collectAsMap()
cc.foreach {
  case (vertexId, clusterId) =>
    println(s"Vertex $vertexId belongs to cluster $clusterId")
}

Output:

Vertex 8 belongs to cluster 7
Vertex 2 belongs to cluster 1
Vertex 5 belongs to cluster 1
Vertex 4 belongs to cluster 1
Vertex 7 belongs to cluster 7
Vertex 1 belongs to cluster 1
Vertex 3 belongs to cluster 1
Vertex 6 belongs to cluster 1
Dharman
  • 30,962
  • 25
  • 85
  • 135
memoryz
  • 479
  • 2
  • 10