How can you use k-means clustering algorithm in r to classify an unlabeled group of data?

Question

I am a new at R language. I have two data sets.One is labeled which is the "Training" data set(Iris data set) and the other is the "Test" data set which is an unlabeled data set. I need to cluster the "Iris" data set and then use the centers of the clusters to place each of the test case into a cluster based upon closest distance, and then assign each of the test case to a cluster.

set.seed(20)
pCluster <- kmeans(Trainingdata[, 3:4], 3, nstart = 20)
pCluster

The above code does cluster the "Training" data set but don't know how to use the centers I get from the above code to label the "Test" data set. Any help would be appreciated.

Possible duplicate of [Simple approach to assigning clusters for new data after k-means clustering](http://stackoverflow.com/questions/20621250/simple-approach-to-assigning-clusters-for-new-data-after-k-means-clustering) — Has QUIT--Anony-Mousse, Aug 12 '16 at 20:42

score 1 · Answer 1 · answered Aug 12 '16 at 08:03

You can get the center values from the pCluster object like so:

pCluster$centers

This gives you the values for Petal.Width and Petal.Length

  Petal.Length Petal.Width
1     1.462000    0.246000
2     4.269231    1.342308
3     5.595833    2.037500

What you can do now is calculate the distance (depending on your measure) of the test data to the centers and assign the closest one.

combinedMatrix = rbind(pCluster$Centers,testData[,3:4])
dist(combinedMatrix)

This gives you a distance matrix with the distance of each point to the cluster centers. As a side note, you should normalize your input data when using kmeans (at least with the most common distance measures) as otherwise features with high absolute values will overshadow features with low absolute values.

However, I am not sure what it is you want to achieve. K-means is not typically used in this way, i.e. with a split in test and training data.

Is your goal to create a classifier for the test set? If so there are better ways to achieve this. If you want to stick to the concept of distances you can take a look at K-Nearest-Neighbor algorithms. If you tell us what your final goal is I am happy to give you more pointers.

How can you use k-means clustering algorithm in r to classify an unlabeled group of data?

1 Answers1