Aligning number of elements in partition in Java Apache Spark

Question

I have two JavaRDD<Double> called rdd1 and rdd2 over which I'd like to evaluate some correlation, e.g. with Statistics.corr(). The two RDDs are generated with many transformations and actions, but at the end of the process, they both have the same number of elements. I know that two conditions must be respected in order to evaluate the correlation, that are related (as far as I understood) to the zip method used in the correlation function. Conditions are:

The RDDs must be split over the same number of partitions
Every partitions must have the same number of elements

Moreover, according to the Spark documentation, I'm using methods over the RDD which preserve ordering, so that the final correlation will be correct (although this wouldn't raise any exception). Now, the problem is that even if I'm able to keep the number of partition consistent, for example with the code

JavaRDD<Double> rdd1Repatitioned = rdd1.repartition(rdd2.getNumPartitions());

what I don't know how to do (and what is giving me exceptions) is to control the number of entries in every partition. I found a workaround that, for now, is working, that is re-initializing the two RDDs I want to correlate

    List<Double> rdd1Array = rdd1.collect();
    List<Double> rdd2Array = rdd2.collect();

    JavaRDD<Double> newRdd1 = sc.parallelize(rdd1Array);
    JavaRDD<Double> newRdd2 = sc.parallelize(rdd2Array);

but I'm not sure this guarantees me anything about the consistency. Second, it might be really expensive computational-wise in some situations. Is there a way to control the number of elements in each partition, or in general to realign the partitions in two or more RDDs (I know more or less how the partitioning system works, and I understand that this might be complicated from the distribution point of view)?

Have you found the solution to this? I hit the same problem. — Jiri Kremser, Aug 30 '16 at 13:31

score 0 · Answer 1 · answered Aug 30 '16 at 13:40

0

Ok, this worked for me:

Statistics.corr(rdd1.repartition(8), rdd2.repartition(8))

answered Aug 30 '16 at 13:40

Jiri Kremser

12,471
7
45
72

Aligning number of elements in partition in Java Apache Spark

1 Answers1