0

I have a big pyspark data frame with the columns as some products and the rows as its prices over time. I need to calculate the covariance matrix of all the products, but the data is too big to convert to a pandas data frame, so I need to do it with pyspark. I've searched it everywhere but I couldn't figure out a solution to this problem. Does anyone have an idea to how it could be done?

I already have the correlation matrix, so any method using the diagonal matrix of standard deviations is also very welcome.

Here is an example of two columns of my dataframe.

keepAlive
  • 6,369
  • 5
  • 24
  • 39
Macro433
  • 1
  • 2
  • Actually, the copiable/pastable version of your data was better than the screenshot you refer to. – keepAlive Jun 15 '21 at 14:11
  • Have you tried using he built in function: https://spark.apache.org/docs/latest/api/sql/#covar_pop? – Vitaliy Jun 15 '21 at 15:41
  • Yeah it calculates the covariance between two columns of the data frame, but I'm not sure on how to create the covariance matrix from it – Macro433 Jun 16 '21 at 20:17

1 Answers1

2

There are a number of linear algebra functions in SparkML. You are probably looking for one of the RowMatrix methods, specifically computeCovariance() (Spark documentation).

Assuming you are looking for the equivalent of:

dummy = pd.DataFrame([[1,2],[2,1]])
dummy.cov()

Then you can, starting from a dataframe, compute the covariance matrix using pyspark with something like the following:

from pyspark.mllib.linalg.distributed import RowMatrix
from pyspark.ml.feature import VectorAssembler

df = spark.createDataFrame([[1, 2], [2, 1]])
vector_col = "cov_features"
assembler = VectorAssembler(inputCols=df.columns, outputCol=vector_col, handleInvalid="skip")
df_vector = assembler.transform(df).select(vector_col)
mat_df = RowMatrix(df_vector)
mat_df = RowMatrix(df_vector.rdd.map(list))
result_df = mat_df.computeCovariance()

The vectorization of the dataframe is required because the pyspark.mllib.linalgis working with vector representations.

RndmSymbl
  • 511
  • 6
  • 24