0

I have read data frame of sensor data, using pandas read_fwf function. I need to find covariance matrix of read 928991 x 8 matrix. Eventually, I want to find eigen vectors and eigen values, using principal component analysis algorithm for this covariance matrix.

ggael
  • 28,425
  • 2
  • 65
  • 71
YatShan
  • 425
  • 2
  • 8
  • 22
  • There is no function in pandas to calculate the covariance matrix. However, there is a function for a correlation matrix. Perhaps you could use that one? – DYZ Apr 28 '19 at 22:42
  • There's a difference between covariance matrix and correlation matrix. Though PCA can be done on both. Covariance matrix is used when the variable scales are similar and the correlation matrix is used when variables are on different scales. I would prefer to use covariance matrix in this scenario, as data from 8 sensors are in same scale. – YatShan Apr 28 '19 at 23:02
  • If you multiply the correlation matrix rowwise and columbwuse by the variances, won't it become a covariance matrix? – DYZ Apr 29 '19 at 00:22
  • @DYZ Yes, but why not just use `pd.DataFrame.cov`? – gmds Apr 29 '19 at 03:14

3 Answers3

2

First, you need to put the pandas dataframe to a numpy array by using df.values. For example:

A = df.values

It would be much easy to compute either covariance matrix or PCA after you put your data into a numpy array. For more:

# import functions you need to compute covariance matrix from numpy
from numpy import array
from numpy import mean
from numpy import cov
from numpy.linalg import eig

# assume you load your data using pd.read_fwf to variable *df*
df = pd.read_fwf(filepath, widths=col_widths, names=col_names)
#put dataframe values to a numpy array
A = df.values
#check matrix A's shape, it should be (928991, 8)
print(A.shape)
# calculate the mean of each column
M = mean(A.T, axis=1)
print(M)
# center columns by subtracting column means
C = A - M
print(C)
# calculate covariance matrix of centered matrix
V = cov(C.T)
print(V)
# eigendecomposition of covariance matrix
values, vectors = eig(V)
print(vectors)
print(values)
# project data
P = vectors.T.dot(C.T)
print(P.T)

Running the example first prints the original matrix, then the eigenvectors and eigenvalues of the centered covariance matrix followed finally by the projection of the original matrix. Here is a link you may found useful for your PCA task.

jintao ren
  • 23
  • 4
  • Hi Thank you for your answer. But printing A = df.values consumes huge memory and takes accountable time to print A. It might be because original data frame has 928991 rows. – YatShan Apr 28 '19 at 23:10
  • 1
    If you can't read the file directly to an array, you could also try to use a chunking method to concatenate it chunks from read_fwf to an array using an iterator. like: 'read_fwf(.... , chunksize = 1000000)' – jintao ren Apr 28 '19 at 23:26
1

Why not just use the pd.DataFrame.cov function?

gmds
  • 19,325
  • 4
  • 32
  • 58
  • Hi gmds, The data frame is 928991 x 12 , where 12 columns are id, time, R1, R2 ... R8, temperature, humidity. Covariance matrix needs to be calculated for R1,R2,R3...R8 columns, that is 928991 x 8. Using pd.DataFrame.cov has returned 4 x 4 matrix, with id, time, R1, humidity. – YatShan Apr 28 '19 at 23:25
  • @Yatshan Are your columns all of numeric type? If they are being omitted, that suggests that they are of `object` type. – gmds Apr 28 '19 at 23:26
  • What is meant by object type ? The data I'm talking about is obtained from the following link.https://archive.ics.uci.edu/ml/datasets/Gas+sensors+for+home+activity+monitoring – YatShan Apr 29 '19 at 03:27
  • 1
    @Yatshan Check `df.dtypes`. – gmds Apr 29 '19 at 03:29
  • It returned the follows`id int64 time float64 R1 float64 R2 object R3 object R4 object R5 object R6 object R7 object R8 object Temp. object Humidity float64 dtype: object` – YatShan Apr 29 '19 at 03:33
  • Yes, so as I said, your columns are of `object` type. It only makes sense to calculate covariance for numbers. You need to perform a conversion first. – gmds Apr 29 '19 at 03:35
  • I converted data with object type float64, using notna().astype('float64'). Then deleted first two columns and the last two columns to obtain 928991 x 8 matrix. I applied pd.DataFrame.cov with modified matrix, it returns 1 x 1 covariance matrix. – YatShan Apr 30 '19 at 04:29
1

The answer of this question would be as follows

import pandas as pd
import numpy as np
from numpy.linalg import eig

df_sensor_data = pd.read_csv('HT_Sensor_dataset.dat', delim_whitespace=True)
del df_sensor_data['id']
del df_sensor_data['time']
del df_sensor_data['Temp.']
del df_sensor_data['Humidity']
df = df_sensor_data.notna().astype('float64')
covariance_matrix = df_sensor_data.cov()
print(covariance_matrix)

values, vectors = eig(covariance_matrix)
print(values)
print(vectors)
YatShan
  • 425
  • 2
  • 8
  • 22