I know that HDMD
package has a function called pairwise.mahalanobis
that is supposed to calculate the pairwise Mahalanobis Distance. However, I also want to introduce weights to this distance and it is not feasible with this function. Therefore, I developed my own code. To test whether it functions well, I first kept it simple, i.e. with no weights, and compared its results to that of pairwise.mahalanobis
function. However the results did not match... Below is the function I use:
dist.maha <- function (X) {
diff = pair.diff(X) # pairwise difference of rows
V <- cov(X) ## empirical covariance; positive definite
L <- t(chol(V)) ## lower triangular factor
stdX <- t(forwardsolve(L, t(diff))) # solving the system of linear equations
return(stdX)
}
And this is its implementation of both alternatives on a toy data:
data = as.matrix(c(100, 54, 56, 79, 12))
dist_manuel = dist.maha(data)
# This is to convert dist_manuel from a vector to a distance matrix
ind_1 = vector(length = choose(nrow(data),2))
ind_2 = vector(length = choose(nrow(data),2))
k =1
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
ind_1[k] = i
ind_2[k] = j
k = k + 1
}
}
dist_manuel = cbind(ind_1,ind_2,dist_manuel)
dist_mat = matrix(data = NA, nrow = nrow(data), ncol = nrow(data))
for (j in 1:(nrow(data)-1)){
for(i in (j+1):nrow(data)){
dist_mat[i,j] = dist_manuel[which(dist_manuel[,1] == i & dist_manuel[,2] == j),3]
}
}
# This is the HDMD alternative
id = c(1,2,3,4,5)
data = cbind(id,data)
HDMD = pairwise.mahalanobis(as.data.frame(data[,2]), grouping = data[,1])
dist_HDMD = HDMD$distance
# The outputs
dist_HDMD
# [,1] [,2] [,3] [,4] [,5]
#[1,] 0 1 4 9 16
#[2,] 1 0 1 4 9
#[3,] 4 1 0 1 4
#[4,] 9 4 1 0 1
#[5,] 16 9 4 1 0
dist_mat
# [,1] [,2] [,3] [,4] [,5]
#[1,] NA NA NA NA NA
#[2,] 1.4002541 NA NA NA NA
#[3,] 1.3393735 -0.06088061 NA NA NA
#[4,] 0.6392465 -0.76100768 -0.7001271 NA NA
#[5,] 2.6787470 1.27849290 1.3393735 2.039501 NA
The results of pairwise.mahalanobi
s function seems completely absurd to me. For starters, it assigns a distance of 1 for both data[2]
& data[3]
and data[2]
& data[1]
which makes no sense when one looks at their values. My function, on the other hand, gives consistent results. For instance, let's compare the ratio of distances between data[1]
& data[2]
and data[1]
& data[3]
.
(100 - 54) / (100 - 56) = 46/44 = 1.045455
Now, this ratio should hold for the distances my function produces as well.
dist_mat[2,1]/dist_mat[3,1]
#[1] 1.045455
And it does! Does that mean that my function works well while the pairwise.mahalanobis
is erroneous ? ( or am I using it incorrectly somehow?) I am not very experienced in R, so I couldn't dare to come to this conclusion by myself. It would be great if someone more experienced than me could confirm my logic.