2

Let's do a quick 3-clusters classification on the iris dataset with the FactoMineR package:

library(FactoMineR)
model <- HCPC(iris[,1:4], nb.clust = 3)
summary(model$data.clust$clust)

 1  2  3
50 62 38

We see that 50 observations are in cluster 1, 62 in cluster 2 and 38 in cluster 3.

Now, we want to visualize these 3 clusters in a dendrogram, with the package dendextend which enables to make pretty ones:

library(dendextend)
library(dplyr)
model$call$t$tree %>% 
    as.dendrogram() %>% 
    color_branches(k = 3, groupLabels = unique(model$data.clust$clust)) %>% 
    plot()

enter image description here

The problem is that the labels on the dendrogram don't meet the true labels of the classification. The cluster 2 should be the biggest one (62 observations according to the data), but on the dendrogram, we clearly see it is the smallest one.

I tried different thinks but nothing work for now, so if you have any idea of which input give to groupLabels = in order to match the real labels, that would be great.

Community
  • 1
  • 1
demarsylvain
  • 2,103
  • 2
  • 14
  • 33

1 Answers1

2

Looking inside dendextend::color_branches, we can see that group labels are assigned using the command g <- dendextend::cutree(dend, k = k, h = h, order_clusters_as_data = FALSE).
This fact can be used for building a map between the cluster labels assigned by HCPC and group labels assigned by dendextend::color_branches.

library(FactoMineR)
library(dendextend)
library(dplyr)
model <- HCPC(iris[,1:4], nb.clust = 3)  

clust.hcpc <- as.numeric(model$data.clust$clust)
clust.cutree <- dendextend:::cutree(model$call$t$tree, k=3, order_clusters_as_data = FALSE)
idx <- order(as.numeric(names(clust.cutree)))
clust.cutree <- clust.cutree[idx]
( tbl <- table(clust.hcpc, clust.cutree) )

###########
          clust.cutree
clust.hcpc  1  2  3
         1 50  0  0
         2  0  0 62
         3  0 36  2

This table shows that cluster labels 2 and 3 are matched with group labels 3 and 2, respectively. (Surprisingly, for two sample units this rule is not true.)

The groups levels that need to be passed to dendextend::color_branches can be found as follows:

( lbls <- apply(tbl,2,which.max) )

##############
1 2 3 
1 3 2

Here is the dendrogram:

model$call$t$tree %>% 
    color_branches(k=3, groupLabels =lbls) %>% 
    set("labels_cex", .5) %>% 
    plot(horiz=T) 

enter image description here

Marco Sandri
  • 23,289
  • 7
  • 54
  • 58
  • Hi Marco. Thanks. If you have any suggestions on how to make dendextend better - feel free to send PR. – Tal Galili May 26 '17 at 09:57
  • Hi Marco, the trick to use `which.max` on the `table` in order to reorder correctly the number is excellent ! Thanks. – demarsylvain May 29 '17 at 19:46
  • Yep. About the two surprising sample units that not respect the rules, it's because of the option `consol = T` in the HCPC function, that makes some modifications (cluster consolidations with kmeans). – demarsylvain May 30 '17 at 15:48
  • @S.Demars Thank you for your explanation! – Marco Sandri May 30 '17 at 15:51
  • 1
    for those needing the labels to be colored as well I had to do the following (I used res.hcpc rather than model): `( dend_to_hcpc <- apply(tbl,2,which.max) )` `( hcpc_to_dend <- apply(tbl, 1, which.max) )` `dend <- res.hcpc$call$t$tree %>% color_branches(k = res.hcpc$call$t$nb.clust, groupLabels = dend_to_hcpc, col = spectral_colors_array[dend_to_hcpc]) %>% set("labels_cex", 0.5) %>% set("branches_lwd", 3.0)` `labels_colors(dend) <- spectral_colors_array[dend_to_hcpc[clust.cutree[match(labels(dend), names(clust.cutree))]]]` – kory Sep 18 '17 at 22:41