I have two JavaRDD<Double>
called rdd1
and rdd2
over which I'd like to evaluate some correlation, e.g. with Statistics.corr()
. The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip
method used in the correlation function. Conditions are:
- The RDDs must be split over the same number of partitions
- Every partitions must have the same number of elements
Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code
JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());
what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate
List<Double> rdd1Array = rdd1.collect();
List<Double> rdd2Array = rdd2.collect();
JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);
but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?