0

I am learning Mahalonobis Distance by following: https://www.machinelearningplus.com/statistics/mahalanobis-distance/

I kind of confused by the concept of the covariance matrix of arrays, assume we have a data frame like this:

        comedy   disaster  action
movie1    0.2     0.3      0.6
movie2    0.4     0.6      0.2
movie3    0.1     0.4      0.8
...

Each row represents an observation and each column represents a variable Now I want to calculate the Mahalonobis Distance between them so I can get a similarity, but first I need to calculate the covraince matrix, I used np.cov(), but it seems this function assumes each column represents an observation, I'm very confused, can someone shows me a detailed process how to calculate the covraince matrix of this? Many thanks.

Cecilia
  • 309
  • 2
  • 12

1 Answers1

0

As I understand your question properly you want to calculate covariance matrix for every column in dataset provided by you. To understand better how np.cov function works you can look at the source code and documentation: As mention in the linked article,


Mahalanobis distance is an effective multivariate distance metric that measures 
the distance between a point and a distribution.

so you should extract each variable from your dataset and calculate distance for every variable (column in this example) in your dataset.

source code

docs

So example calculation for comedy variable should looks like follows:


import numpy as np 

tmp_var = df.comedy.values #Now its type will be numpy.ndarray as required in docs

comedy_cov_mat =   np.cov(tmp_var)

# comedy_cov_mat should then have nxn shape when n is number of rows in your dataset.

s3nh
  • 546
  • 2
  • 11
  • I want to use Mahalanobis distance for measuring how close it is between two movies, does that I need to calculate the covariance by row? Could you please provide a detailed calculation process?Many thanks – Cecilia Jul 26 '19 at 14:17
  • In simple term, mahalanobis distance is a measure of distance between a variable data point and its distribution. It does not give an Infirmation between random variables. If you want to look at distance between distribution you should look at entropy measures, for example Hellinger distance or total variation distance – s3nh Jul 26 '19 at 14:23
  • Does this mean I can't use mahalanobis distance to measure the similarity between movies? I tried cosine similarity, Euclidean distance – Cecilia Jul 26 '19 at 14:34