How to force spark to perform reduction locally

Question

I'm looking for a trick to force Spark to perform reduction operation locally between all the tasks executed by the worker cores before making it for all tasks. Indeed, it seems my driver node and the network bandwitch are overload because of big task results (=400MB).

val arg0 = sc.broadcast(fs.read(0, 4))
val arg1 = sc.broadcast(fs.read(1, 4))
val arg2 = fs.read(5, 4) 
val index = info.sc.parallelize(0.toLong to 10000-1 by 1)
val mapres = index.map{ x => function(arg0.value, arg1.value, x, arg2) }
val output = mapres.reduce(Util.bitor)

The driver distributes 1 partition by processor core so 8 partitions by worker.

You want reduceByKey which does local merging before aggregating. — Kyle Hale, May 24 '16 at 19:33

zero323 · Accepted Answer · 2016-05-24T20:58:30.327

3

There is nothing to force because reduce applies reduction locally for each partition. Only the final merge is applied on the driver. Not to mention 400MB shouldn't be a problem in any sensible configuration.

Still if you want to perform more work on the workers you can use treeReduce although with 8 partitions there is almost nothing to gain.

edited May 24 '16 at 20:58

answered May 24 '16 at 19:46

zero323

322,348
103
959
935

How to force spark to perform reduction locally

1 Answers1