How to calculate covariance matrix of data frame

Question

I have read data frame of sensor data, using pandas read_fwf function. I need to find covariance matrix of read 928991 x 8 matrix. Eventually, I want to find eigen vectors and eigen values, using principal component analysis algorithm for this covariance matrix.

There is no function in pandas to calculate the covariance matrix. However, there is a function for a correlation matrix. Perhaps you could use that one? — DYZ, Apr 28 '19 at 22:42
There's a difference between covariance matrix and correlation matrix. Though PCA can be done on both. Covariance matrix is used when the variable scales are similar and the correlation matrix is used when variables are on different scales. I would prefer to use covariance matrix in this scenario, as data from 8 sensors are in same scale. — YatShan, Apr 28 '19 at 23:02
If you multiply the correlation matrix rowwise and columbwuse by the variances, won't it become a covariance matrix? — DYZ, Apr 29 '19 at 00:22

jintao ren · Answer 1 · 2019-04-28T23:08:40.110

First, you need to put the pandas dataframe to a numpy array by using df.values. For example:

A = df.values

It would be much easy to compute either covariance matrix or PCA after you put your data into a numpy array. For more:

# import functions you need to compute covariance matrix from numpy
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig

# assume you load your data using pd.read_fwf to variable *df*
df = pd.read_fwf(filepath, widths=col_widths, names=col_names)
#put dataframe values to a numpy array
A = df.values
#check matrix A's shape, it should be (928991, 8)
print(A.shape)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix followed finally by the projection of the original matrix. Here is a link you may found useful for your PCA task.

Hi Thank you for your answer. But printing A = df.values consumes huge memory and takes accountable time to print A. It might be because original data frame has 928991 rows. — YatShan, Apr 28 '19 at 23:10
If you can't read the file directly to an array, you could also try to use a chunking method to concatenate it chunks from read_fwf to an array using an iterator. like: 'read_fwf(.... , chunksize = 1000000)' — jintao ren, Apr 28 '19 at 23:26

score 1 · Answer 2 · answered Apr 28 '19 at 23:03

1

Why not just use the pd.DataFrame.cov function?

answered Apr 28 '19 at 23:03

gmds

19,325
4
32
58

Hi gmds, The data frame is 928991 x 12 , where 12 columns are id, time, R1, R2 ... R8, temperature, humidity. Covariance matrix needs to be calculated for R1,R2,R3...R8 columns, that is 928991 x 8. Using pd.DataFrame.cov has returned 4 x 4 matrix, with id, time, R1, humidity. – YatShan Apr 28 '19 at 23:25
@Yatshan Are your columns all of numeric type? If they are being omitted, that suggests that they are of `object` type. – gmds Apr 28 '19 at 23:26
What is meant by object type ? The data I'm talking about is obtained from the following link.https://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring – YatShan Apr 29 '19 at 03:27
1

@Yatshan Check `df.dtypes`. – gmds Apr 29 '19 at 03:29
It returned the follows`id int64 time float64 R1 float64 R2 object R3 object R4 object R5 object R6 object R7 object R8 object Temp. object Humidity float64 dtype: object` – YatShan Apr 29 '19 at 03:33
Yes, so as I said, your columns are of `object` type. It only makes sense to calculate covariance for numbers. You need to perform a conversion first. – gmds Apr 29 '19 at 03:35
I converted data with object type float64, using notna().astype('float64'). Then deleted first two columns and the last two columns to obtain 928991 x 8 matrix. I applied pd.DataFrame.cov with modified matrix, it returns 1 x 1 covariance matrix. – YatShan Apr 30 '19 at 04:29

YatShan · Accepted Answer · 2019-04-30T05:26:03.267

The answer of this question would be as follows

import pandas as pd
import numpy as np
from numpy.linalg import eig

df_sensor_data = pd.read_csv('HT_Sensor_dataset.dat', delim_whitespace=True)
del df_sensor_data['id']
del df_sensor_data['time']
del df_sensor_data['Temp.']
del df_sensor_data['Humidity']
df = df_sensor_data.notna().astype('float64')
covariance_matrix = df_sensor_data.cov()
print(covariance_matrix)

values, vectors = eig(covariance_matrix)
print(values)
print(vectors)

How to calculate covariance matrix of data frame

3 Answers3