2

I have read k-means: Same clusters for every execution.

But it doesn't solve the problem I am having. I am sampling data that varies in sizes (increases in sizes). I need to cluster the data using k-means but the problem I am having is that each sample the clusters differ. The important thing to note is that my t+1 sample will always incorporate all of the components from the tth sample. So it slowly gets bigger and bigger. What I need is a way to be able to have the clusters stay the same. Is there a way around this other than using set.seeds? I am open to any solution.

Community
  • 1
  • 1
user1234440
  • 22,521
  • 18
  • 61
  • 103

1 Answers1

2

The best way I can think to accomplish this would be to initially cluster the data with k-means and then to simply assign all additional data to closest cluster (setting the random seed will not help you to get the new clusters to nest within the original ones). As detailed in the answer to this question, the flexclust package makes this pretty easy:

# Split into "init" (used for initial clustering) and "later" (assigned later)
set.seed(100)
spl <- sample(nrow(iris), 0.5*nrow(iris))
init <- iris[spl,-5]
later <- iris[-spl,-5]

# Build the initial k-means clusters with "init"
library(flexclust)
(km <- kcca(init, k=3, kccaFamily("kmeans")))
# kcca object of family ‘kmeans’ 
# 
# call:
# kcca(x = init, k = 3, family = kccaFamily("kmeans"))
# 
# cluster sizes:
# 
#  1  2  3 
# 31 25 19 

# Assign each element of "later" to the closest cluster
head(predict(km, newdata=later))
#  2  5  7  9 14 18 
#  2  2  2  2  2  2 
Community
  • 1
  • 1
josliber
  • 43,891
  • 12
  • 98
  • 133
  • It's worth noting that what this is is k-means, followed by k-nearest neighbor. One caveat is that if the goal is to assign new data points to clusters iteratively, it might make more sense to recompute the centroids after each added data point. It doesn't look like this does that. – Jeff Jan 18 '15 at 03:16
  • @Jeff no, this doesn't recompute centroids. However recomputing centroids and then iteratively reassigning points to the closest cluster is not what the OP wants, because the OP wants the original cluster assignment to nest within the final clusters. – josliber Jan 18 '15 at 03:22
  • @josliber ah, what i mean is recompute the centroid means. – Jeff Jan 18 '15 at 04:39
  • This is perfect! Now it would be awesome to also see if it's possible to periodically update my initial model to incorporate the new data points to keep it dynamic.. Thanks – user1234440 Jan 18 '15 at 12:26
  • @user1234440 this is the only way I know to have the original assignments nest within the new assignments. If you are willing to change the original assignments, then why not just re-run k-means clustering? – josliber Jan 18 '15 at 18:27
  • I am essentially doing a supervised learning problem. What I am trying to do is be aware and make sure cluster 1 represents the same group of objects through out. – user1234440 Jan 18 '15 at 21:45