0

Assume a problem where I have an RDD X, I calculate the mean m in single a worker node and then I want to calculate X-m to e.g. calculate stdevs. I want this to happen in the cluster, not the driver node i.e. I want m to be distributed. I thought of implementing it as a cartesian product of those two RDDs so that essentially as soon as m gets calculated, it propagates to all workers and they calculate X-m. My fear is that Spark will shuffle X's to where m lives and do the subtraction there. Is there a guarantee on to who will shuffled in case of X.cartesian(m)?

The mean/stedev problem above is for illustration purposes - I know it's not excellent but it's simple enough.

neverlastn
  • 2,164
  • 16
  • 23
  • What is X-m? The mean of X? Why do you calculate the mean m in single worker node first? – tsiki Jul 11 '15 at 18:03
  • Thanks @tsiki. The mean is one of the metrics I calculate and it's reducedByKey to one of the worker nodes. X-m is {X vector} when you subtract m from every one of its items. This is part of a streaming application. – neverlastn Jul 11 '15 at 19:15

0 Answers0