1

I am currently doing research using the ASJP Database and I have a distance matrix of the similarities between 30 languages in the shape of (30 x 30). I would like to perform K-Means Clustering on these languages.

I passed the distance matrix to sklearn's K-Means Clustering and got results that made sense.

But I've read that K-Means Clustering can't work with distance matrices such as here. But if that is the case, why am I getting clusters that make sense (i.e., languages that are close to each other are in the same cluster)? Am I getting wrong results that look right?

I tried reducing the dimensionality of my dataset using Classic (Metric) Multidimensional Scaling (CMD Scaling) but when I did the resulting clusters became weird and didn't make much sense.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
gliwidilt
  • 31
  • 1
  • 3

1 Answers1

0

K-means need data points to calculate two things, first the Centroids of each cluster, Second the Euclidean distance between these points and centroids (for every iteration).
Therefore, distance matrix representation cannot help and it probably has wrong results.

k-medoids, is the name of the k-means variation that could help you fix your issue, The trick behind it is that it takes medians (exist in the data points) for comparisons instead of centroids (which they might not be in the data points).

This question on Cross Validated helped me with similar issue.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Ali Massoud
  • 135
  • 1
  • 10