K-means clustering with pre-defined centroids

Question

I'm trying to run K-means algorithm with predefined centroids. I have had a look at the following posts:

However, every time I run the command:

km = kmeans(df_std[,c(10:13)], centers = centroids)

I get the following error:

**Error: empty cluster: try a better set of initial centers**

I have defined the centroids as:

centroids = matrix(c(140.12774, 258.62615, 239.36800, 77.43235,
                      33.37736, 58.73077,  68.80000,  12.11765,
                     0.8937264, 0.8118462, 0.8380000, 0.8052941,
                     11.989858, 12.000000, 8.970000,  1.588235),
ncol = 4, byrow = T)

And my data, is a subset of a data frame say: df_std. It has been scaled already

df_std[,c(10:13)]

I'm wondering why would the system give the above error? Any help on this would be highly appreciated!

Are you *sure* that is what you want? The clusters would move (if they don't become empty). You most likely want to do nearest neighbor *classification* instead of custering... — Has QUIT--Anony-Mousse, Jul 12 '18 at 16:52
@Anony-Mousse Yes, I definitely want centroid based clustering! I am replicating some work which I did on one data set. And now, for the new data set I do not want unsupervised clustering, rather I want to extract the similar groups. — Sandy, Jul 12 '18 at 23:41
@Anony-Mousse I also see there are many people, who based on their needs, must have to go through centroid based clustering, please see: https://tolstoy.newcastle.edu.au/R/e9/help/10/01/0906.html — Sandy, Jul 13 '18 at 00:16
Do you want the centers to move, or not? At least one of these clusters is empty, and will disappear. — Has QUIT--Anony-Mousse, Jul 13 '18 at 06:28
To explain my problem in more detail. The earlier work that I did had let's say 4 clusters A,B,C and D. Clusters A and B were densely populated while C and D were sparse. This classification was based on a set containing eight features (X = 8). If I want similar distribution of my observations based on the identical eight features, shouldn't I use K-means with predefined centroids? — Sandy, Jul 13 '18 at 10:30
No. I would use the previous centroids and a one nearest neighbor *classifier*. Because you want to use the same centers, you don't want them to move to very different locations. — Has QUIT--Anony-Mousse, Jul 13 '18 at 17:59

score 4 · Answer 1 · answered Jul 13 '18 at 18:01

Use a nearest neighbor classifier using the centers only, do not recluster.

That means every point is labeled just as the nearest center. This is similar to k-means but you do not change the centers, you do not need to iterate, and every new data point can be processed independently and in any order. No problem arises when processing just a single point at a time (in your case, k-means failed because one cluster became empty!)

Sandy · Accepted Answer · 2018-07-12T23:38:25.377

While browsing for the specific error that I posted above:

Error: empty cluster: try a better set of initial centers

I found the following link to a conversation:

http://r.789695.n4.nabble.com/Empty-clusters-in-k-means-possible-solution-td4667114.html

Broadly speaking, the above error is generated when the centroids don't match with the data.

It can happen when k is a number: due to random starts of the k-means algorithm, there is a possibility that the centres do not match with data

It may also happen when k represents the centroids (my case). The problem was: my data was scaled but my centroids were unscaled.

The above shared link made me realise that there is a bug in my code. Hope it will help someone in a similar situation as mine!

K-means clustering with pre-defined centroids

2 Answers2