1

I know that HDMD package has a function called pairwise.mahalanobis that is supposed to calculate the pairwise Mahalanobis Distance. However, I also want to introduce weights to this distance and it is not feasible with this function. Therefore, I developed my own code. To test whether it functions well, I first kept it simple, i.e. with no weights, and compared its results to that of pairwise.mahalanobis function. However the results did not match... Below is the function I use:

dist.maha <- function (X) {
  diff = pair.diff(X) # pairwise difference of rows 
  V <- cov(X)  ## empirical covariance; positive definite
  L <- t(chol(V))  ## lower triangular factor
  stdX <- t(forwardsolve(L, t(diff)))  # solving the system of linear equations
  return(stdX)
}

And this is its implementation of both alternatives on a toy data:

data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)

# This is to convert dist_manuel from a vector to a distance matrix 

ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1  
for (j in 1:(nrow(data)-1)){
  for(i in (j+1):nrow(data)){
    ind_1[k] = i
    ind_2[k] = j    
    k = k + 1
  }
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))

for (j in 1:(nrow(data)-1)){
  for(i in (j+1):nrow(data)){
    dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
  }
}

# This is the HDMD alternative 

id = c(1,2,3,4,5)
data = cbind(id,data) 
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance

# The outputs

dist_HDMD

#     [,1] [,2] [,3] [,4] [,5]
#[1,]    0    1    4    9   16
#[2,]    1    0    1    4    9
#[3,]    4    1    0    1    4
#[4,]    9    4    1    0    1
#[5,]   16    9    4    1    0

dist_mat

#          [,1]        [,2]       [,3]     [,4] [,5]
#[1,]        NA          NA         NA       NA   NA
#[2,] 1.4002541          NA         NA       NA   NA
#[3,] 1.3393735 -0.06088061         NA       NA   NA
#[4,] 0.6392465 -0.76100768 -0.7001271       NA   NA
#[5,] 2.6787470  1.27849290  1.3393735 2.039501   NA

The results of pairwise.mahalanobis function seems completely absurd to me. For starters, it assigns a distance of 1 for both data[2] & data[3] and data[2] & data[1] which makes no sense when one looks at their values. My function, on the other hand, gives consistent results. For instance, let's compare the ratio of distances between data[1] & data[2] and data[1] & data[3].

(100 - 54) / (100 - 56) = 46/44 = 1.045455

Now, this ratio should hold for the distances my function produces as well.

dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455

And it does! Does that mean that my function works well while the pairwise.mahalanobis is erroneous ? ( or am I using it incorrectly somehow?) I am not very experienced in R, so I couldn't dare to come to this conclusion by myself. It would be great if someone more experienced than me could confirm my logic.

  • Where does the `pair.diff` function come from? Is it `ICSNP`? – kluu Mar 15 '19 at 12:27
  • Yes it comes from ICSNP – Elif Cansu Akoğuz Mar 15 '19 at 12:33
  • Look at https://stats.stackexchange.com/questions/65705/pairwise-mahalanobis-distances – abhiieor Mar 22 '19 at 14:08
  • 1
    Not knowing what the function is supposed to do, you might try comparing to the source of the HDMD function and deciding for yourself whether you think it's doing the right thing: https://rdrr.io/cran/HDMD/src/R/HDMD_package.R#sym-pairwise.mahalanobis – Calum You Mar 22 '19 at 20:13
  • Your code throws an error for me: `Error in dist_mat[i, j] <- dist_manuel[which(dist_manuel[, 1] == i & dist_manuel[, : number of items to replace is not a multiple of replacement length` – duckmayr Mar 23 '19 at 14:38

1 Answers1

2

There is an error in your dist.maha function. This is obvious straight away because some of the distances it computes are negative numbers -- so they cannot be the actual distances! Fortunately, this is quite easy to fix as you simply need to square your stdX vector.

library("HDMD")
library("tidyverse")

# Convert a vector of pairwise distances from to a distance matrix
# (Simplified approach which doesn't use for-loops)
pairwise_dist_to_dist_matrix <- function(dist, n) {
  stopifnot(length(dist) == n*(n-1)/2)
  mat <- matrix(NA_real_, n, n)
  diag(mat) <- 0
  mat[lower.tri(mat)] <- dist
  mat
}

dist.maha <- function (X) {
  diff <- pair.diff(X)                 # pairwise difference of rows
  V <- cov(X)                          # empirical covariance; positive definite
  L <- t(chol(V))                      # lower triangular factor
  stdX <- t(forwardsolve(L, t(diff)))  # solving the system of linear equations
  dist <- stdX * stdX                  # Don't forget to square!
  dist <- rowSums(dist)                # And add up the differences in each dimension.
  pairwise_dist_to_dist_matrix(dist, nrow(X))
}

# An alternative computation, for an additional check
dist.maha2 <- function(X) {
  diff <- pair.diff(X)
  V <- cov(X)
  Vinv <- solve(V)
  dist <- rowSums(diff %*% Vinv * diff)
  pairwise_dist_to_dist_matrix(dist, nrow(X))
}


# Slightly more complex data matrix to check if
# functions work in higher dimensions
data <- matrix(c(100, 54, 56, 79, 12, 1, 2, 3, 4, 5), ncol = 2)
dist.maha(data)
#>          [,1]      [,2]     [,3]     [,4] [,5]
#> [1,] 0.000000        NA       NA       NA   NA
#> [2,] 2.275210 0.0000000       NA       NA   NA
#> [3,] 1.974017 0.9742819 0.000000       NA   NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000   NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146    0
dist.maha2(data)
#>          [,1]      [,2]     [,3]     [,4] [,5]
#> [1,] 0.000000        NA       NA       NA   NA
#> [2,] 2.275210 0.0000000       NA       NA   NA
#> [3,] 1.974017 0.9742819 0.000000       NA   NA
#> [4,] 4.759700 7.5842687 3.250906 0.000000   NA
#> [5,] 7.896067 3.6213875 1.974017 5.690146    0

It also seems that you don't use pairwise.mahalanobis correctly. You have to compute and pass the covariance matrix (the cov argument).

# This is the HDMD alternative
id = c(1,2,3,4,5)

# Incorrect:

# You have to specify the `cov` argument.
# Otherwise `pairwise.mahalanobis` doesn't compute it correctly
# as each sample is assumed to be in its own group.

pairwise.mahalanobis(data, grouping = id)$distance
#>           [,1]      [,2]      [,3]     [,4]      [,5]
#> [1,]     0.000 4345.2805 3840.7349  759.689 15362.940
#> [2,]  4345.280    0.0000   16.7591 1487.209  3369.708
#> [3,]  3840.735   16.7591    0.0000 1194.197  3840.735
#> [4,]   759.689 1487.2090 1194.1967    0.000  9310.174
#> [5,] 15362.940 3369.7076 3840.7349 9310.174     0.000

# Correct:

# NOTE: Ignore the warning message; there seems to be a small bug in `pairwise.mahalanobis`.
pairwise.mahalanobis(data, grouping = id, cov = cov(data))$distance
#> Warning in if (dim(cov) != c(p, p)) stop("cov matrix not of dim = (p,p)
#> \n"): the condition has length > 1 and only the first element will be used
#>          [,1]      [,2]      [,3]     [,4]     [,5]
#> [1,] 0.000000 2.2752099 1.9740168 4.759700 7.896067
#> [2,] 2.275210 0.0000000 0.9742819 7.584269 3.621388
#> [3,] 1.974017 0.9742819 0.0000000 3.250906 1.974017
#> [4,] 4.759700 7.5842687 3.2509059 0.000000 5.690146
#> [5,] 7.896067 3.6213875 1.9740168 5.690146 0.000000

Created on 2019-03-24 by the reprex package (v0.2.1)

dipetkov
  • 3,380
  • 1
  • 11
  • 19
  • Can you tell me why is the distance between row 3 and row 1(1,974) is less than the difference between row 3 and row 4 (3,251). Row 3 is equally far away from row 1 and 4 in the second dimension and it is closer to row 4 in first dimension. Shouldn't its distance to row 4 be smaller? – Elif Cansu Akoğuz May 08 '19 at 08:39
  • Row 3 is closer to row 4 than row 1 in the second dimension. – dipetkov May 09 '19 at 18:33
  • That aside, I think you ask an interesting question. Comparing the differences in the two dimensions doesn't take into account the covariance matrix V or more precisely its inverse V^-1. For example, the diagonal entries of V^-1 "scale" each dimension in an appropriate way. In this case, V^-1[1,1] = 0.002 and V^-1[2,2] = 0.84, so a difference of 2 in the first dimension is less "important" than difference of 2 in the second dimension. – dipetkov May 09 '19 at 18:40
  • Actually I made a mistake while writing down my first comment, I wanted to compare the distances of 3&4 and 3&5. The distance between 3 and 5 (44,2) is higher than the distance between 3 and 4 (23,1) in both dimensions. Yet we see that the Mahalanobis Distance between 3&5 (1.974017) is smaller than that of 3&4 (3.250906). I don't understand how come this can be the case, even considering your point about the covariance matrix.... – Elif Cansu Akoğuz May 10 '19 at 13:32
  • Again, it is because of the covariance matrix V^-1. In this case, the two dimensions happen to be negatively correlated. You can check that `V[1,2] < 0` and so `Vinv[1,2] > 0` where `Vinv = solve(V)`. – dipetkov May 10 '19 at 19:09
  • Then you can check that the distance`c(1,1) %*% Vinv %*% c(1,1) ` is bigger than the distance`c(1,-1) %*% Vinv %*% c(1,-1) `, even though in absolute value the difference in both dimensions is `1`. – dipetkov May 10 '19 at 19:11
  • Now let's go back to the example. The difference between 3 & 4 is actually (-23,-1). The difference between 3 & 5 is (44,-2). Or (-44,2). What matters is that the signs are different. – dipetkov May 10 '19 at 19:13
  • So to summarize, the shape of the covariance matrix V is important. (Not just the diagonal values as my first comment suggested.) Comparing the absolute values of the differences in each dimension basically boils down to using Euclidean distance: `(x-y)' * (x-y)`. – dipetkov May 10 '19 at 19:19