2

I've been asked to calculate the average distance of each point to its centroid. The data set and number of clusters have been provided and this seems like a very straight forward question (given what k-means clustering does), but I can't seem to find a workable solution.

The dataset is a 3 column, 500 row excel worksheet with floating numbers.

From what I've read, the easiest way to do this is by putting the distances of the points to the each centroid in a numpy array and calculating the average. This is what I've done below.

from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
'exec(%matplotlib inline)'
import numpy as np

df = pd.read_excel('k-means_test.xlsx', sheet_name='data_set')
X = np.array(df)
plt.scatter(X[:,0],X[:,1], label = 'True Position')

kmeans = KMeans(n_clusters=5)
kmeans.fit(X)

## print(kmeans.cluster_centers_)

plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
## plt.show()

distances = kmeans.fit_transform(X)
variance = 0
i = 0
for label in kmeans.labels_:
    variance = variance + distances[i][label]
    i = i + 1

mean_distance = np.mean(distances)
print(mean_distance)

I was expecting a value between 1.41 and 2.85, but I'm getting 11.3. Pretty far off.

Any help would be greatly appreciated. I'm pretty new to python and machine learning algorithms in general.

Brian F.
  • 21
  • 1
  • 3
  • Have a look at: https://stackoverflow.com/questions/40828929/sklearn-mean-distance-from-centroid-of-each-cluster – Maximilian Peters Jul 12 '19 at 05:53
  • Thanks @MaximilianPeters, I tried to incorporate some of this code into my example above, but things got out of hand (beginner here and to me some of the code in the example was complex) - so I went down the numpy mean approach. I'll take another look and see if I can derive something. – Brian F. Jul 12 '19 at 15:15

1 Answers1

0

K-means uses squared Eculidean distances.

People often mistakenly assume this is means minimizing Euclidean distance - it doesn't.

Anyway, try inserting distances = numpy.sqrt(distances) and your mean will likely be below 3 afterwards.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
  • Hi @Anony-Mousse, thanks for the comment - conceptually every helpful. Unfortunately this resulted in 3.37 - so I must have really messed up somewhere. Perhaps there's an issue with trying to use Euclidean distances? – Brian F. Jul 12 '19 at 15:10
  • Well, you are computing the mean distance to all three centers... – Has QUIT--Anony-Mousse Jul 13 '19 at 01:20
  • Hmm... not sure how that should impact the point that my answer still needs to be between 1.41 and 2.85 for the mean distance of all points to their respective centers. And there's 5 clusters, not 3. Are you suggesting the code above is calculating the means distance for all points to all 5 centers? – Brian F. Jul 15 '19 at 14:55
  • Yes. To all centers for every point. Check the shape of `distances`. – Has QUIT--Anony-Mousse Jul 16 '19 at 01:25