1

I read the thread there but in this thread, the answer is saying using the column means for the cluster i (so if I have 3 clusters in total, i = 1,2,3; and if I have 2 clusters overall, i = 1,2). I will copy and paste their answers here:

clusters = cutree(hclust(dist(USArrests)), k=5) # get 5 clusters

# function to find medoid in cluster i
clust.centroid = function(i, dat, clusters) {
    ind = (clusters == i)
    colMeans(dat[ind,])
}

sapply(unique(clusters), clust.centroid, USArrests, clusters)

              [,1]       [,2]   [,3]  [,4]  [,5]
Murder    11.47143   8.214286   5.59  14.2  2.95
Assault  263.50000 173.285714 112.40 336.0 62.70
UrbanPop  69.14286  70.642857  65.60  62.5 53.90
Rape      29.00000  22.842857  17.27  24.0 11.51

But that does not make sense to me! If I have a data set with 3 variables/columns, and I only want 2 clusters, using their method, only the column means for column 1 & 2 are used, and column mean for 3rd column will never be calculated!

Let's say I created such data table:

a = c(1,2,3,4,2,2,5,3,1)
b = c(4,5,2,2,1,1,1,1,3)
c = c(1,1,1,0,0,0,0,0,1)
abc = data.frame(a=a, b=b, c=c)
str(abc)

And the last line will return the following data table:

'data.frame':   9 obs. of  3 variables:
 $ a: num  1 2 3 4 2 2 5 3 1
 $ b: num  4 5 2 2 1 1 1 1 3
 $ c: num  1 1 1 0 0 0 0 0 1

I then scale the data:

abc_scaled = scale(abc)

Calculate distance and create hierarchical cluster and cut the tree:

distance = dist(abc_scaled, method="euclidean")
hcluster = hclust(distance, method="ward.D")
clusters = cutree(hcluster, h = (max(hcluster$height) - 0.1))

Let's say i get 2 clusters as result, how can i compare the centroids of the 2 clusters? and how can i add the labels to the clusters???

alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49
  • but here you're using the three columns of your data.frame ? This part is not clear to me "Let's say I have a data set with 3 variables/columns with 4th column being the response var (which i wont use in clustering process), and I only want 2 clusters, using their method, I'll only use the column means for column 1 & 2 (beacause there's only 2 clusters), and column mean for 3rd column will never be used!" – Vincent Bonhomme Mar 28 '16 at 06:40
  • hi @VincentBonhomme. so the data set has 4 columns in total, with 4th column being the response var and so won't be included in the clustering process. and so i only have 3 columns of data in my example. does this make more sense now? – alwaysaskingquestions Mar 28 '16 at 06:48
  • But this is still unclear to me= "only the column means for column 1 & 2 are used, and column mean for 3rd column will never be calculated!" – Vincent Bonhomme Mar 28 '16 at 06:50
  • hi @VincentBonhomme, if you read the thread i have linked to you'd understand what im talking about. maybe i should copy over their answers to my post so its easier for you to understand? i can do that right now. – alwaysaskingquestions Mar 28 '16 at 06:56
  • I actually read it but still don't understand. Maybe I'm the problem ;-) – Vincent Bonhomme Mar 28 '16 at 07:00
  • hi @VincentBonhomme so i just added their answers and a bit more explanations to my post. does this clear up your confusion now? – alwaysaskingquestions Mar 28 '16 at 07:00
  • Got it now, thanks. I think [this page](http://www.stat.berkeley.edu/~s133/Cluster2a.html) is a better starting point that the thread. – Vincent Bonhomme Mar 28 '16 at 07:27

0 Answers0