How to calculate the covariance matrix of a pyspark dataframe?

Question

I have a big pyspark data frame with the columns as some products and the rows as its prices over time. I need to calculate the covariance matrix of all the products, but the data is too big to convert to a pandas data frame, so I need to do it with pyspark. I've searched it everywhere but I couldn't figure out a solution to this problem. Does anyone have an idea to how it could be done?

I already have the correlation matrix, so any method using the diagonal matrix of standard deviations is also very welcome.

Here is an example of two columns of my dataframe.

Actually, the copiable/pastable version of your data was better than the screenshot you refer to. — keepAlive, Jun 15 '21 at 14:11
Have you tried using he built in function: https://spark.apache.org/docs/latest/api/sql/#covar_pop? — Vitaliy, Jun 15 '21 at 15:41
Yeah it calculates the covariance between two columns of the data frame, but I'm not sure on how to create the covariance matrix from it — Macro433, Jun 16 '21 at 20:17

score 2 · Answer 1 · answered Oct 12 '21 at 14:53

There are a number of linear algebra functions in SparkML. You are probably looking for one of the RowMatrix methods, specifically computeCovariance() (Spark documentation).

Assuming you are looking for the equivalent of:

dummy = pd.DataFrame([[1,2],[2,1]])
dummy.cov()

Then you can, starting from a dataframe, compute the covariance matrix using pyspark with something like the following:

from pyspark.mllib.linalg.distributed import RowMatrix
from pyspark.ml.feature import VectorAssembler

df = spark.createDataFrame([[1, 2], [2, 1]])
vector_col = "cov_features"
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col, handleInvalid="skip")
df_vector = assembler.transform(df).select(vector_col)
mat_df = RowMatrix(df_vector)
mat_df = RowMatrix(df_vector.rdd.map(list))
result_df = mat_df.computeCovariance()

The vectorization of the dataframe is required because the pyspark.mllib.linalgis working with vector representations.

How to calculate the covariance matrix of a pyspark dataframe?

1 Answers1