2

I have used hclust to identify clusters in my data, and determine the nature of those clusters. The following is a very simplified version:

gg <- c(1,2,4,3,3,15,16)
hh <- c(1,10,3,10,10,18,16)
z <- data.frame(gg,hh)
means <- apply(z,2,mean)
sds <- apply(z,2,sd)
nor <- scale(z,center=means,scale=sds) 
d <- dist(nor, method = "euclidean")
fit <- hclust(d, method="ward.D2")
plot(fit)
rect.hclust(fit, k=3, border="red")  
groups <- cutree(fit, k=3) 
aggregate(nor,list(groups),mean)

Using aggregate I can see that these three clusters include a cluster with low values on both gg and hh variables, a cluster with low gg and average hh, and a cluster with high gg and high hh values

How can I see where these are on the dendrogram (so far I can only tell by examining the sizes of the groups and comparing them to the sizes on the dendrogram)? And how can I somehow label those cluster groups on the dendrogram (eg add something like "low", "med", "high" names over each cluster)? I prefer answers in base R

B.Kenobi
  • 209
  • 3
  • 6

1 Answers1

2

Unfortunately, without using the dendextend package, there are no simple options available for labeling. The closest bet is to to make use of the border argument in the rect.hclust() formula to color the rectangles... but that's no fun. Take a look at - http://www.sthda.com/english/wiki/beautiful-dendrogram-visualizations-in-r-5-must-known-methods-unsupervised-machine-learning.

In this case with 2 columns I would recommend simply plotting the z data.frame and coloring or grouping visually by your groups. If you label the points, that would further make it comparable to the dendogram. See this example:

# your data
gg <- c(1,2,4,3,3,15,16)
hh <- c(1,10,3,10,10,18,16)
z <- data.frame(gg,hh)

# a fun visualization function
visualize_clusters <- function(z, nclusters = 3, 
                           groupcolors = c("blue", "black", "red"), 
                           groupshapes = c(16,17,18), 
                           scaled_axes = TRUE){
  nor <- scale(z) # already defualts to use the datasets mean, sd)
  d <- dist(nor, method = "euclidean")
  fit <<- hclust(d, method = "ward.D2") # saves fit to the environment too
  groups <- cutree(fit, k = nclusters) 

  if(scaled_axes) z <- nor
  n <- nrow(z)
  plot(z, main = "Visualize Clusters",
       xlim = range(z[,1]), ylim = range(z[,2]),
       pch = groupshapes[groups], col = groupcolors[groups])
  grid(3,3, col = "darkgray") # dividing the plot into a grid of low, medium and high
  text(z[,1], z[,2], 1:n, pos = 4)

  centroids <- aggregate(z, list(groups), mean)[,-1]
  points(centroids, cex = 1, pch = 8, col = groupcolors)
  for(i in 1:nclusters){
    segments(rep(centroids[i,1],n), rep(centroids[i,2],n), 
             z[groups==i,1], z[groups==i,2], 
             col = groupcolors[i])
  }
  legend("topleft", bty = "n", legend = paste("Cluster", 1:nclusters), 
         text.col = groupcolors, cex = .8)
}

Now we can plot them together:

par(mfrow = c(2,1))
visualize_clusters(z, nclusters = 3, groupcolors = c("blue", "black", "red"))
plot(fit); rect.hclust(fit, 3, border = rev(c("blue", "black", "red")))
par(mfrow = c(1,1)

enter image description here

Make note of the grid for your eye-exam of low-low, low-med, high-high.

I love line segments. Try it on larger data like:

gg <- runif(30,1,20)
hh <- c(runif(10,5,10),runif(10,10,20),runif(10,1,5))
z <- data.frame(gg,hh)
visualize_clusters(z, nclusters = 3, groupcolors = c("blue", "black", "red"))

enter image description here

Hope this helps a little bit.

Evan Friedland
  • 3,062
  • 1
  • 11
  • 25
  • Thanks, I actually have 3 variables that I'm using as cluster variables, but I appreciate your plots and may use them sometime in the future (I was making due with simple scatterplots comparing two variables at a time colored by group but I like how you have illustrated the central point of each group). I've decided to just change the border colors of rect.hclust as per your recomendation and add a legend. I suppose I dont really need to have R map the groups to the dendrogram - as I can figure it out from groups size - I just thought it would be cool. – B.Kenobi Sep 10 '18 at 01:20