0

In python pandas , when I have a dataframe df like this

c1 c2 c3
0.1 0.3 0.5
0.2 0.4 0.6

I can use df.corr() to calculate a correlation matrix .

How do I do that in spark with scala ?

I have read the official document , The data struct isn't like above . I don't know how to transfer it .

Update one:

val df = Seq(
    (0.1, 0.3, 0.5,0.6,0.8,0.1, 0.3, 0.5,0.6,0.8),
    (0.2, 0.4, 0.6,0.7,0.7,0.2, 0.4, 0.6,0.7,0.7),
).toDF("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")

val assembler = new VectorAssembler().setInputCols(Array("c1", "c2", "c3","c4","c5","c6", "c7", "c8","c9","c10")).setOutputCol("vectors")

How to show the whole result when the number of column is 10 ?

DachuanZhao
  • 1,181
  • 3
  • 15
  • 34
  • Does this answer your question? [How to get the correlation matrix of a pyspark data frame?](https://stackoverflow.com/questions/52214404/how-to-get-the-correlation-matrix-of-a-pyspark-data-frame) – JAdel Mar 08 '22 at 13:37
  • No . It uses ```pyspark``` while I want a ```scala spark``` answer . – DachuanZhao Mar 08 '22 at 13:41
  • Check this out for a scala solution: https://spark.apache.org/docs/latest/ml-statistics.html – JAdel Mar 08 '22 at 13:43
  • Take a look to https://stackoverflow.com/a/70411405/6802156. Once you build the RowMatrix from the DF it´s immediate – Emiliano Martinez Mar 08 '22 at 14:36
  • I have read the document , its dataframe's struct isn't same as mine ... – DachuanZhao Mar 09 '22 at 01:36

1 Answers1

2

You can solve your problem with the following code. It will apply the Pearson correlation which is also standard for the Pandas function.

import org.apache.spark.ml.feature.VectorAssembler
import org.apache.spark.ml.linalg.Vectors
import org.apache.spark.ml.stat.Correlation

val df = Seq(
    (0.1, 0.3, 0.5),
    (0.2, 0.4, 0.6),
).toDF("c1", "c2", "c3")

val assembler = new VectorAssembler()
  .setInputCols(Array("c1", "c2", "c3"))
  .setOutputCol("vectors")

val transformed = assembler.transform(df)

val corr = Correlation.corr(transformed, "vectors").head

println(s"Pearson correlation matrix:\n $corr")
elyptikus
  • 936
  • 8
  • 24