I've been asked to calculate the average distance of each point to its centroid. The data set and number of clusters have been provided and this seems like a very straight forward question (given what k-means clustering does), but I can't seem to find a workable solution.
The dataset is a 3 column, 500 row excel worksheet with floating numbers.
From what I've read, the easiest way to do this is by putting the distances of the points to the each centroid in a numpy array and calculating the average. This is what I've done below.
from sklearn.cluster import KMeans
import pandas as pd
import matplotlib.pyplot as plt
'exec(%matplotlib inline)'
import numpy as np
df = pd.read_excel('k-means_test.xlsx', sheet_name='data_set')
X = np.array(df)
plt.scatter(X[:,0],X[:,1], label = 'True Position')
kmeans = KMeans(n_clusters=5)
kmeans.fit(X)
## print(kmeans.cluster_centers_)
plt.scatter(X[:,0],X[:,1], c=kmeans.labels_, cmap='rainbow')
## plt.show()
distances = kmeans.fit_transform(X)
variance = 0
i = 0
for label in kmeans.labels_:
variance = variance + distances[i][label]
i = i + 1
mean_distance = np.mean(distances)
print(mean_distance)
I was expecting a value between 1.41 and 2.85, but I'm getting 11.3. Pretty far off.
Any help would be greatly appreciated. I'm pretty new to python and machine learning algorithms in general.