1

I have a JavaRDD which contains arrays of doubles. Now I want to calculate the pearson coefficient between each array. But if I convert the rdd to vectors and apply statistics.corr(), the function calculates for the columns but i want it to calculate for rows. Can anyone suggest a way to convert my data to columns so that i can apply corr() function on that?

Edit: The statistics.corr() function takes javardd < vector > as input.

Goutham Panneeru
  • 165
  • 2
  • 11

1 Answers1

1

you can try to convert each row to an RDD[Double] and compare their combinations (manually or loops)

val seriesX: RDD[Double] = ... // row1
val seriesY: RDD[Double] = ... // row2 must have the same number of partitions and cardinality as seriesX
val correlation: Double = Statistics.corr(seriesX, seriesY, "pearson")

or you can try to transpose your RDD and pass the resulting rdd to the corr(..) - some ideas on transposing here - How to transpose an RDD in Spark

if you have multiple row/records though and you want corrrelations of each against others the matrix might be too big and both options might take too low (if not possible at all)

Community
  • 1
  • 1