Shuffling on Spark cartesian product

Asked Jul 11 '15 at 12:54

Active Jul 11 '15 at 12:54

Viewed 272 times

Assume a problem where I have an RDD X, I calculate the mean m in single a worker node and then I want to calculate X-m to e.g. calculate stdevs. I want this to happen in the cluster, not the driver node i.e. I want m to be distributed. I thought of implementing it as a cartesian product of those two RDDs so that essentially as soon as m gets calculated, it propagates to all workers and they calculate X-m. My fear is that Spark will shuffle X's to where m lives and do the subtraction there. Is there a guarantee on to who will shuffled in case of X.cartesian(m)?

The mean/stedev problem above is for illustration purposes - I know it's not excellent but it's simple enough.

asked Jul 11 '15 at 12:54

neverlastn

2,164
16
23

What is X-m? The mean of X? Why do you calculate the mean m in single worker node first? – tsiki Jul 11 '15 at 18:03
Thanks @tsiki. The mean is one of the metrics I calculate and it's reducedByKey to one of the worker nodes. X-m is {X vector} when you subtract m from every one of its items. This is part of a streaming application. – neverlastn Jul 11 '15 at 19:15

Shuffling on Spark cartesian product

0 Answers0