3

I have a big Graph (a few million vertices and edges). I want to remove all the vertices (& edges) which has no outgoing edges. I have some code that works but it is slow and I need to do it several times. I am sure I can use some existing GraphX method to make it much faster.

This is the code I have.

val users: RDD[(VertexId, String)] = sc.parallelize(Array((1L, "1"), (2L, "2"), (3L, "3"), (4L, "4")))
  val relationships: RDD[Edge[Double]] = sc.parallelize(
    Array(
      Edge(1L, 3L, 500.0),
      Edge(3L, 2L, 400.0),
      Edge(2L, 1L, 600.0),
      Edge(3L, 1L, 200.0),
      Edge(2L, 4L, 200.0),
      Edge(3L, 4L, 500.0)
    ))

val graph = org.apache.spark.graphx.Graph(users, relationships)

val lst = graph.outDegrees.map(x => x._1).collect
var set:scala.collection.mutable.HashSet[Long] = new scala.collection.mutable.HashSet()
for(a<- lst) {set.add(a)}
var subg = graph.subgraph(vpred = (id, attr) => set.contains(id))
//since vertex 4 has no outgoing edges, subg.edges should return 4 and subg.vertices = 3 

I don't know how else this can be achieved. Any help is appreciated!

EDIT: I could do it with HashSet but I think it can still be improved.

Mann
  • 307
  • 2
  • 14

4 Answers4

3

You could directly define another graph with the filtered vertices. Something like this:

val lst = graph.outDegrees.map(x => x._1).collect
var graph2 = Graph(graph.vertices.filter(v => lst.contains(v)), graph.edges)
fingerprints
  • 2,751
  • 1
  • 25
  • 45
0

A first optimization to your code is to have lst be a set rather than an array, which would make the lookup O(1) rather than O(n)

But this is not scalable since you are collecting everything on the driver then sending it back to the executors. The right way would be to call joinVertices with outDegrees and just map to the original graph.

Ulysse Mizrahi
  • 681
  • 7
  • 19
  • Thanks for the input. I tried doing it with HashSet but it didn't work. Can you give an example how can I achieve it with joinVertices? Btw, JoinVertices is not an Inner Join. – Mann May 15 '18 at 10:22
0

If you do not want to use subgraph, here is another way using triplets to find those destination vertices which are also source vertices.

val graph = org.apache.spark.graphx.Graph(users, relationships)
val AsSubjects = graph.triplets.map(triplet => (triplet.srcId,(triplet)))
val AsObjects = graph.triplets.map(triplet => (triplet.dstId,(triplet)))
val ObjectsJoinSubjects = AsObjects.join(AsSubjects)
val ObjectsJoinSubjectsDistinct = ObjectsJoinSubjects.mapValues(x => x._1).distinct()
val NewVertices = ObjectsJoinSubjectsDistinct.map(x => (x._2.srcId, x._2.srcAttr)).distinct()
val NewEdges = ObjectsJoinSubjectsDistinct.map(x => new Edge(x._2.srcId, x._2.dstId, x._2.attr))
val newgraph = Graph(NewVertices,NewEdges)

I am not sure if this provides an improvement over subgraph because my solution uses distinct() which is expensive. I tested with the graph you have provided and my solution actually takes longer. However, I feel that this is a small example. Therefore, I would suggest that you test with a larger graph and let us know if this works better.

0

You could you this to find all the zero outdegree verices.

val zeroOutDeg = graph.filter(graph => {
   val degrees: VertexRDD[Int] = graph.outDegrees
   graph.outerJoinVertices(degrees) {(vid, data, deg => deg.getOrElse(0)}
   }, vpred = (vid: VertexId, deg:Int) => deg == 0)