3

Why would one use kmedoids algoirthm rather then kmeans? Is it only the fact that the number of metrics that can be used in kmeans is very limited or is there something more?

Is there an example of data, for which it makes much more sense to choose the best representatives of cluster from the data rather then from R^n?

Antoine
  • 862
  • 7
  • 22
  • I'm not sure what you understand by metric in k-means, can you explain it a little bi more? – Kazenga Mar 01 '14 at 18:10
  • Metric is a function that measures distance between instances and satisfies some additional conditions (see http://en.wikipedia.org/wiki/Metric_%28mathematics%29 under Definition). Standard version of k-means uses Euclidean metric/distance, for example. – Antoine Mar 07 '14 at 06:24
  • Well, you can use all those metrics, in k-means as well as in k-medoids, so there is no difference between them(answer to your second question) – Kazenga Mar 07 '14 at 14:23
  • @Kazenga No, you cannot or at least - not easily. Computing the means of the current clusters in k-means minimizes the objective function for Euclidean distance. If Manhattan metric is used instead, the medians should be computed. If some other metric is used, the minimizing point might not be so easily found. – Antoine Sep 19 '22 at 08:07

4 Answers4

4

The problem with k-means is that it is not interpretable. By interpretability i mean the model should also be able to output the reason that why it has resulted a certain output. lets take an example. Suppose there is food review dataset which has two posibility that there is a +ve review or a -ve review so we can say we will have k= 2 where k is the number of clusters. Now if you go with k-means where in the algorithm the third step is updation step where you update your k-centroids based on the mean distance of the points that lie in a particular cluster. The example that we have chosen is text problem, so you would also apply some kind of text-featured vector schemes like BagOfWords(BOW), word2vec. now for every review you would get the corresponding vector. Now the generated centroid c_i that you will get after running the k-means would be the mean of the vectors present in that cluster. Now with that centroid you cannot interpret much or rather i should say nothing.

But for same problem you apply k-medoids wherein you choose your k-centroids/medoids from your dataset itself. lets say you choose x_5 point from your dataset as first medoid. From this your interpretability will increase beacuse now you have the review itself which is termed as medoid/centroid. So in k-medoids you choose the centroids from your dataset itself. This is the foremost motivation of introducing k-mediods

Coming to the metrics part you can apply all the metrics that you apply for k-means

Hope this helps.

Aditya
  • 950
  • 8
  • 37
  • 1
    This is a nice aspect I haven't thought of, great! Still, you could generate the text back I think (and maybe get typical words that the "central" review contains) but that would mean additional work. – Antoine Jun 06 '19 at 14:49
  • 1
    @Antoine yes you could get them back but just interpreting them is very tough and yes you said correctly that additional work would be cumbersome. – Aditya Jun 06 '19 at 15:02
1

The K-Means algorithm uses a Distance Function such as Euclidean Distance or Manhattan Distance, which are computed over vector-based instances. The K-Medoid algorithm instead uses a more general (and less constrained) distance function: aka pair-wise distance function. This distinction works well in contexts like Complex Data Types or relational rows, where the instances have a high number of dimensions.

High dimensionality problem

In standard clustering libraries and the k-means algorithms, the distance computation phase can spend a lot of time scanning the entire vector of attributes that belongs to an instance; for instance, in the context of documents clustering, using the standard TF-IDF representation. During the computation of the cosine similarity, the distance function scans all the possible words that appear in the whole collection of documents. Which in many cases can be composed by millions of entries. This is why, in this domain, some authors [1] suggests to restrict the words considered to a subset of N most frequent word of that language.

Using K-Kedoids there is no need to represent and store the documents as vectors of word frequencies. As an alternative representation for the documents is possible to use the set of words appearing at least twice in the document; and as a distance measure, there can be used Jaccard Distance. In this case, vector representation is long as the number of words in your dictionary.

Heterogeneousity and Complex Data Types.

There are many domains where is considerably better to abstract the implementation of an instance:

  • Graph's nodes clustering;
  • Car driving behaviour, represented as GPS routes;

Complex data type allows the design of ad-hoc distance measures which can fit better with the proper data domain.

[1] Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. 2008. Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

Source: https://github.com/eracle/Gap

  • I believe that you are not answering the questions. Kmeans and Kmedoids both have high dimensionality problem ... – Antoine Mar 07 '14 at 06:30
1

Difference between is that in k-means centroids(cluster centrum) are calculated as average of vectors containing in the cluster, and in k-medoids the medoid (cluster centrum) is record from dataset closest to centroid, so if you need to represent cluster centrum by record of your data you use k-medoids, otherwise i should use k-means (but concept of these algorithms are same)

Kazenga
  • 476
  • 1
  • 4
  • 15
1

Why would we use k-medoids instead of k-means in case of (squared) Euclidean distance?

1. Technical justification

In case of relatively small data sets (as k-medoids complexity is greater) - to obtain a clustering more robust to noise and outliers.

Example 2D data showing that:

enter image description here The graph on the left shows clusters obtained with K-medoids (sklearn_extra.cluster.KMedoids method in Python with default options) and the one on the right with K-means for K=2. Blue crosses are cluster centers.

The Python code used to generate green points:

import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(seed=32)
a = rng.random((6,2))*2.35 - 3*np.ones((6,2))
b = rng.random((50,2))*0.25 - 2*np.ones((50,2))
c = rng.random((100,2))*0.5 - 1.5*np.ones((100,2))
d = rng.random((7,2))*0.55

points = np.concatenate((a, b, c, d))
plt.plot(points[:,0],points[:,1],"g.", markersize=8, alpha=0.3) # green points

2. Business case justification

Here are some example business cases showing why we would prefer k-medoids. They mostly come down to the interpretability of the results and the fact that in k-medoids the resulting cluster centers are members of the original dataset.

2.1 We have a recommender engine based only on user-item preference data and want to recommend to the user those items (e.g. movies) that other similar people enjoyed. So we assign the user to his/her closest cluster and recommend top movies that the cluster representant (actual person) watched. If the cluster representant wasn't an actual person we wouldn't possess the history of actually watched movies to recommend. Each time we'd have to search additionally e.g. for the closest person from the cluster. Example data: classic MovieLens 1M Dataset

2.2 We have a database of patients and want to pick a small representative group of size K to test a new drug with them. After clustering the patients with K-medoids, cluster representants are invited to the drug trial.

hanna
  • 627
  • 9
  • 15