I have a big Graph (a few million vertices and edges). I want to remove all the vertices (& edges) which has no outgoing edges. I have some code that works but it is slow and I need to do it several times. I am sure I can use some existing GraphX method to make it much faster.
This is the code I have.
val users: RDD[(VertexId, String)] = sc.parallelize(Array((1L, "1"), (2L, "2"), (3L, "3"), (4L, "4")))
val relationships: RDD[Edge[Double]] = sc.parallelize(
Array(
Edge(1L, 3L, 500.0),
Edge(3L, 2L, 400.0),
Edge(2L, 1L, 600.0),
Edge(3L, 1L, 200.0),
Edge(2L, 4L, 200.0),
Edge(3L, 4L, 500.0)
))
val graph = org.apache.spark.graphx.Graph(users, relationships)
val lst = graph.outDegrees.map(x => x._1).collect
var set:scala.collection.mutable.HashSet[Long] = new scala.collection.mutable.HashSet()
for(a<- lst) {set.add(a)}
var subg = graph.subgraph(vpred = (id, attr) => set.contains(id))
//since vertex 4 has no outgoing edges, subg.edges should return 4 and subg.vertices = 3
I don't know how else this can be achieved. Any help is appreciated!
EDIT: I could do it with HashSet but I think it can still be improved.