1

I am implementing an algorithm for k-means clustering. So far it works using Euclidean distances. Switching out Euclidean distances for Mahalanobis distances fails to cluster correctly.

For some reason, the Mahalanobis distance is negative at times. Turns out the covariance matrix has negative eigenvalues, which apparently is not good for covariance matrices.

Here are the functions I'm using:

#takes in data point x, centroid m, covariance matrix sigma
def mahalanobis(x, m, sigma):
    return np.dot(np.dot(np.transpose(x - m), np.linalg.inv(sigma)),  x - m)

#takes in centroid m and data (iris in 2d, dimensions: 2x150)
def covar_matrix(m, data):
    d, n = data.shape
    R = np.zeros((d,d)) 
    for i in range(n): 
        R += np.dot(data[:,i:i+1] , np.transpose(data[:,i:i+1]))
    R /= n
    return R - np.dot(m, np.transpose(m))
    #autocorrelation_matrix - centroid*centroid'

How I implemented the algorithm:

  1. Set k

  2. Randomly choose k centroids

  3. Calculate covar_matrix() of each centroid

  4. Calculate mahalanobis() of each data point to each centroid and add to closest cluster

  5. Start looking for new centroids; for each data point* in each cluster, calculate the sum of mahalanobis() to every other point in the cluster; point with minimum sum becomes new centroid
  6. Repeat 3-5 until old centroid and new centroids are the same

*Calculate covar_matrix() with this point

I expect a positive Mahalanobis distance and a positive definite covariance matrix (the latter will fix the former I hope).

0 Answers0