We have assignment of finding the shortest path using Pregel API for 3lac vertices. We are supposed to make each vertex as source vertex once and identify the shortest path among all these executions. My code looks like below,
def shortestPath(sc: SparkContext, mainGraph: Graph[(String, String, Double), Double], singleSourceVertexFlag: Boolean) {
var noOfIterations = mainGraph.vertices.count();
// If single source vertext is true, pass only count as one iteration only
if (singleSourceVertexFlag) {
noOfIterations = 1
} else { // else loop through complete list of vertices
noOfIterations = mainGraph.vertices.count()
}
for (i <- 0 to (noOfIterations.toInt - 1)) {
val sourceId: VertexId = i
val modGraph = mainGraph.mapVertices((id, attr) =>
if (id == sourceId) (0.0)
else (Double.PositiveInfinity))
val loopItrCount = modGraph.vertices.count().toInt;
val sssp = modGraph.pregel(Double.PositiveInfinity, loopItrCount, EdgeDirection.Out)(
(id, dist, newDist) =>
if (dist < newDist) dist
else newDist, // Vertex Program
triplet => { // Send Message
if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
} else {
Iterator.empty
}
},
(a, b) =>
if (a < b) a // Merge Message
else b)
sssp.unpersist(true)
modGraph.unpersist(true)
println("****Shortest Path End**** SourceId" + sourceId)
}
}
From this code I have to read shortest path from each loop and from them identify the minimum value as the final output(which is the future part and i am yet to code for the same).
Now this current code works fine for 15node graph and also 1112 node graph. But when I try to execute the algorithm for 22k node graph the algorithm executes for 55source nodes and then stops with Out of memory error. We have a two node cluster(1node - 64GB RAM, 2node - 32GB RAM)
Question are,
1. How does the for loops are treated on Spark cluster? Does anything which I have to modify in the code so that the code is optimized?
2. I am trying to use unpersist so that the at each loop the RDD are cleared and new one is created for every loop. But still I get out of memory after it executes for 55 nodes. What should be done to execute the same for all the nodes?