Assume a problem where I have an RDD X, I calculate the mean m in single a worker node and then I want to calculate X-m to e.g. calculate stdevs. I want this to happen in the cluster, not the driver node i.e. I want m to be distributed. I thought of implementing it as a cartesian product of those two RDDs so that essentially as soon as m gets calculated, it propagates to all workers and they calculate X-m. My fear is that Spark will shuffle X's to where m lives and do the subtraction there. Is there a guarantee on to who will shuffled in case of X.cartesian(m)?
The mean/stedev problem above is for illustration purposes - I know it's not excellent but it's simple enough.