0

I have a function that takes the neighbors of a node ,for the neighbors i use broadcast variable and the id of the node itself and it calculates the closeness centrality for that node.I map each node of the graph with the result of that function.When i open the task manager the cpu is not utilized at all as if it is not working in parallel , the same goes for memory , but the every node executes the function in parallel and also the data is large and it takes time to complete ,its not like it does not need the resources.Every help is truly appreciated , thank you. For loading the graph i use val graph = GraphLoader.edgeListFile(sc, path).cache

object ClosenessCentrality {

  case class Vertex(id: VertexId)

  def run(graph: Graph[Int, Float],sc: SparkContext): Unit = {
    //Have to reverse edges and make graph undirected because is bipartite
    val neighbors = CollectNeighbors.collectWeightedNeighbors(graph).collectAsMap()
    val bNeighbors = sc.broadcast(neighbors)

    val result = graph.vertices.map(f => shortestPaths(f._1,bNeighbors.value))
    //result.coalesce(1)
    result.count()

  }

  def shortestPaths(source: VertexId,  neighbors: Map[VertexId, Map[VertexId, Float]]): Double ={
    val predecessors = new mutable.HashMap[VertexId, ListBuffer[VertexId]]()
    val distances = new mutable.HashMap[VertexId, Double]()
    val q = new FibonacciHeap[Vertex]
    val nodes = new mutable.HashMap[VertexId, FibonacciHeap.Node[Vertex]]()

    distances.put(source, 0)

    for (w <- neighbors) {
      if (w._1 != source)
        distances.put(w._1, Int.MaxValue)

      predecessors.put(w._1, ListBuffer[VertexId]())
      val node = q.insert(Vertex(w._1), distances(w._1))
      nodes.put(w._1, node)
    }

    while (!q.isEmpty) {
      val u = q.minNode
      val node = u.data.id
      q.removeMin()
      //discover paths
      //println("Current node is:"+node+" "+neighbors(node).size)
      for (w <- neighbors(node).keys) {
        //print("Neighbor is"+w)
        val alt = distances(node) + neighbors(node)(w)
//        if (distances(w) > alt) {
//          distances(w) = alt
//          q.decreaseKey(nodes(w), alt)
//        }
//        if (distances(w) == alt)
//          predecessors(w).+=(node)
         if(alt< distances(w)){
           distances(w) = alt
           predecessors(w).+=(node)
           q.decreaseKey(nodes(w), alt)
         }

      }//For
    }
    val sum = distances.values.sum
    sum
  }
user3224454
  • 194
  • 2
  • 16
  • Did you launch your program on a cluster or localy ? If it is a local master, did you specify the number of cores to use, like this : `--master=local[8]`. Alternatively, how many partitions does your dataset have ? If it only has a single partition, then a single core is used. – A.Perrot Jan 30 '17 at 16:10
  • Yes i did with other programs it uses a lot more resources.For the partitions i left the default that it has when i load the graph from an edge list file but i thought of that and i used coalesce with 10 for 8 cores that i have should i use more or am i doing this wrong ; – user3224454 Jan 30 '17 at 16:13
  • Can you provide some code ? – A.Perrot Jan 30 '17 at 16:16
  • Thanks for the code. It is weird that you do not use any methods from GraphX to process the graph. Something like computing the shortest path should be done using the Pregel API. Also, on your first instruction, you seem to be collecting the entire adjacency of the graoh onto the driver. This is a big "code smell" for spark. I know this does not really answer the question, but you might want to change your approach of the problem to better use the distributed nature of Spark. – A.Perrot Jan 30 '17 at 16:26
  • Graphx provides Pregel API to compute the the SSSP for a node of the graph which is the same for me , but you can not do that in parallel for every node its like using the graph it self for every node in parallel.I have not thought yet of something better .At first i though the same because finding all the neighbors is almost as big as the graph which is not good but i do not know how to do it in parallel . – user3224454 Jan 30 '17 at 16:32
  • The Pregel API is your friend. But having an all-nodes-to-all-nodes shortest path on a big graph (O(n^2) memory) might not be a good idea. Anyway, you have an algorithmic problem that is not relevant to this question. We can help you with that, just ask the right question :) – A.Perrot Jan 30 '17 at 16:46
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/134397/discussion-between-user3224454-and-a-perrot). – user3224454 Jan 30 '17 at 17:06

1 Answers1

1

To provide somewhat of an answer to your original question, I suspect that your RDD only has a single partition, thus using a single core to process.

The edgeListFile method has an argument to specify the minimum number of partitions you want. Also, you can use repartition to get more partitions.

You mentionned coalesce but that only reduces the number of partitions by default, see this question : Spark Coalesce More Partitions

Community
  • 1
  • 1
A.Perrot
  • 323
  • 1
  • 8
  • Thank you now it works but the number of partitions how do i choose it based on the number of cores of the cpu ; – user3224454 Jan 30 '17 at 16:51
  • At minimum, you should have as many partitions as cores. But I highly recommend having more than that, to ensure that each partition is small enough (especially if you want every vertex to keep track of the entire graph). My advice is : tet, test and more test, and see if you can find a sweet spot. – A.Perrot Jan 30 '17 at 16:59
  • Just as a sidenode to that: Decreasing the size of each partition (thus increasing the _number_ of partitions) doesn't seem to be the right way in 100% of the cases. The overhead per partition is simply too large to be ignored, so make sure you have a balance between those. The test (and test more) advice is good enough for a small/medium-sized job anyways. – dennlinger Mar 15 '17 at 11:30