4

Let`s say I have a graph with double values for edge attributes and I want to find the maximum edge weight of my graph. If I do this:

val max = sc.accumulator(0.0) //max holds the maximum edge weight
g.edges.distinct.collect.foreach{ e => if (e.attr > max.value) max.value
= e.attr }

I want to ask how much work is done on the master and how much on the executors, because I know that collect() method brings the entire RDD to the master? Does a parallelism happen? Is there a better way to find the maximum edge weight?

NOTE:

g.edges.distinct.foreach{ e => if (e.attr > max.value) max.value =
e.attr } // does not work without the collect() method.
//I use an accumulator because I want to use the max edge weight later

And if I want to apply some averaging function to the attributes of edges that have same srcId and dstId between two graphs, what is the best way to do it?

Al Jenssen
  • 655
  • 3
  • 9
  • 25

1 Answers1

5

You can either aggregate:

graph.edges.aggregate(Double.NegativeInfinity)(
  (m, e) => e.attr.max(m),
  (m1, m2) => m1.max(m2)
)

or map and take max:

 graph.edges.map(_.attr).max

Regarding your attempts:

  1. If you collect all data is processed sequentially on a driver so there is no reason to use an accumulator.
  2. it doesn't work because accumulators are write-only from a worker perspective.
zero323
  • 322,348
  • 103
  • 959
  • 935