1

I'm trying to reproduce this example using Excel to calculate the Mahalanobis distance between two groups.

Plot of data from example

To my mind the example provides a good explanation of the concept. However, I'm not able to reproduce in R.

The result obtained in the example using Excel is Mahalanobis(g1, g2) = 1.4104.

Following the answer given here for R and apply it to the data above as follows:

# dataset used in the Excel example
g1 <- matrix(c(2, 2, 2, 5, 6, 5, 7, 3, 4, 7, 6, 4, 5, 3, 4, 6, 2, 5, 1, 3), ncol = 2, byrow = TRUE)
g2 <- matrix(c(6, 5, 7, 4, 8, 7, 5, 6, 5, 4), ncol = 2, byrow = TRUE)

# function adopted from R example
D.sq <- function (g1, g2) {
    dbar <- as.vector(colMeans(g1) - colMeans(g2))
    S1 <- cov(g1)
    S2 <- cov(g2)
    n1 <- nrow(g1)
    n2 <- nrow(g2)
    V <- as.matrix((1/(n1 + n2 - 2)) * (((n1 - 1) * S1) + ((n2 - 1) * S2)))
    D.sq <- t(dbar) %*% solve(V) %*% dbar
    res <- list()
    res$D.sq <- D.sq
    res$V <- V
    res
}

D.sq(g1,g2)

and executing the function on the data returns the following output:

$D.sq
         [,1]
[1,] 1.724041

$V
          [,1]      [,2]
[1,] 3.5153846 0.3153846
[2,] 0.3153846 2.2230769

Afaik $D.sq represents the distance and 1.724 is significantly different to the 1.4101 result from the Excel example. As I'm new to the concept of the Mahalanobis distance I was wondering if I did something wrong and/or there's a better way to calculate this e.g. using mahalanobis()?

raumkundschafter
  • 429
  • 1
  • 8
  • 24

1 Answers1

2

The reasons why do you get different result are

  • The Excel algorithm is actually different to the R algorithm in how you calculate the pooled covariance matrix, the R version gives you the result of unbiased estimate of covariance matrix, while the Excel version gives you the MLE estimate. In the R version, you calculate the matrix like: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2 - 2); while in Excel version: ((n1 - 1) * cov(g1) + (n2 - 1) * cov(g2)) / (n1 + n2).

  • The last calculation step in the Excel post you refer to is incorrect, the result should be 1.989278 instead.

Edit:

The unbiased estimator for pooled covariance matrix is the standard way, as is in the Wikipedia page: https://en.wikipedia.org/wiki/Pooled_variance . A related fact is that in R, when you use cov or var, you get an unbiased estimator instead of MLE estimator for covariance matrix.

Edit2: The mahalanobis function in R calculates the mahalanobis distance from points to a distribution. It does not calculate the mahalanobis distance of two samples.

Conclusion: In sum, the most standard way to calculate mahalanobis distance between two samples is the R code in the original post, which uses the unbiased estimator of pooled covariance matrix.

Consistency
  • 2,884
  • 15
  • 23
  • Nice explanation. Is there an argument in cov to handle that or should one create a custom function for it? – Sotos Jun 19 '17 at 19:50
  • @Consistency many thanks for your reply, definitely helps to shed light on the issue! I haven't marked it as accepted answer yet as I've two questions in reply to your answer: 1. Is there a 'correct'/standard way to calculate the pooled covariance matrix? E.g. what is used in mahalanobis(). In [this](https://stackoverflow.com/a/20237947/5731401) post for example it's calculated as: `((n1/n3)*cov(g1)) + ((n2/n3)*cov(g2))` with `n3=n1+n2`. 2. Why do you get `1.989278` as a result and not `1.724`? – raumkundschafter Jun 19 '17 at 19:55
  • @Sotos It seems that R's `cov` and `var` all gives unbiased estimate for the covariance (use n - 1 as denominator instead of n in this case), there is no argument to let it give MLE estimate. If one keeps using MLE estimate a lot, maybe one should create a custom function for it. – Consistency Jun 19 '17 at 19:57
  • Yeah, thought so. I have implemented mahalanobis for an anomaly detection method I built which was implemented into a commercial tool and works fine. No need for MLEs. Was just curious. Cheers – Sotos Jun 19 '17 at 19:59
  • @raumkundschafter Unbiased estimator for pooled covariance matrix like your original R code is most commonly used as far as I know. If you want a standard way, just use this one, though you cannot say the MLE one is incorrect. – Consistency Jun 19 '17 at 20:03
  • @raumkundschafter And the way `((n1/n3)*cov(g1)) + ((n2/n3)*cov(g2))` is not correct, it doesn't have any statistical meaning. It is just some kind of weighted average of sample covariance, not pooled covariance. – Consistency Jun 19 '17 at 20:08
  • @Consistency many thanks again for nice and detailed clarification! Thus using the unbiased estimator for the pooled covariance results in the `1.989278` and the MLE estimate returns `1.724`. – raumkundschafter Jun 19 '17 at 20:40
  • @raumkundschafter You are welcome! MLE estimate should result 1.989 and unbiased estimate should result 1.724, which should be the standard result. – Consistency Jun 19 '17 at 20:47
  • @Consistency Indeed, got it switched around! Blame it on the late time and heat in these latitudes ;) – raumkundschafter Jun 19 '17 at 21:13