4

We have assignment of finding the shortest path using Pregel API for 3lac vertices. We are supposed to make each vertex as source vertex once and identify the shortest path among all these executions. My code looks like below,

def shortestPath(sc: SparkContext, mainGraph: Graph[(String, String, Double), Double], singleSourceVertexFlag: Boolean) {

var noOfIterations = mainGraph.vertices.count();
// If single source vertext is true, pass only count as one iteration only
if (singleSourceVertexFlag) {
  noOfIterations = 1
} else { // else loop through complete list of vertices
  noOfIterations = mainGraph.vertices.count()
}

for (i <- 0 to (noOfIterations.toInt - 1)) {
  val sourceId: VertexId = i
  val modGraph = mainGraph.mapVertices((id, attr) =>
    if (id == sourceId) (0.0)
    else (Double.PositiveInfinity))

  val loopItrCount = modGraph.vertices.count().toInt;
  val sssp = modGraph.pregel(Double.PositiveInfinity, loopItrCount, EdgeDirection.Out)(

    (id, dist, newDist) =>
      if (dist < newDist) dist
      else newDist, // Vertex Program

    triplet => { // Send Message
      if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
        Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
      } else {
        Iterator.empty
      }
    },

    (a, b) =>
      if (a < b) a // Merge Message
      else b)

  sssp.unpersist(true)
  modGraph.unpersist(true)

  println("****Shortest Path End**** SourceId" + sourceId)

}

}

From this code I have to read shortest path from each loop and from them identify the minimum value as the final output(which is the future part and i am yet to code for the same).

Now this current code works fine for 15node graph and also 1112 node graph. But when I try to execute the algorithm for 22k node graph the algorithm executes for 55source nodes and then stops with Out of memory error. We have a two node cluster(1node - 64GB RAM, 2node - 32GB RAM)

Question are,
1. How does the for loops are treated on Spark cluster? Does anything which I have to modify in the code so that the code is optimized?
2. I am trying to use unpersist so that the at each loop the RDD are cleared and new one is created for every loop. But still I get out of memory after it executes for 55 nodes. What should be done to execute the same for all the nodes?

Sarala Hegde
  • 121
  • 1
  • 6
  • Did you solve the above problem @Sarala Hegde – Yasir Arfat Dec 17 '16 at 11:37
  • 1
    @Aroon, later on we have used AWS clusters to execute it. In AWS we could execute better compared to two node cluster, but we could not completely run for 3lac nodes. The algorithm was taking to much of time. – Sarala Hegde Feb 08 '17 at 09:36

0 Answers0