3

I wish to visualize how well a clustering algorithm is doing (with certain distance metric). I have samples and their corresponding classes. To visualize, I cluster and I wish to color the branches of a dendrogram by the items in the cluster. The color will be the color most items in the hierarchical cluster correspond to (given by the data\classes).

Example: If my clustering algorithm chose indexes 1,21,24 to be a certain cluster (at a certain level) and I have a csv file containing a class number in each row corresponding to lets say 1,2,1. I want this edge to be coloured 1.

Example Code:

require(cluster)
suppressPackageStartupMessages(library(dendextend))
dir <- 'distance_metrics/'
filename <- 'aligned.csv'
my.data <- read.csv(paste(dir, filename, sep=""), header = T, row.names = 1)
my.dist <- as.dist(my.data)
real.clusters <-read.csv("clusters", header = T, row.names = 1)
clustered <- diana(my.dist)
# dend <- colour_branches(???dend, max(real.clusters)???)
plot(dend)

EDIT: another example partial code

dir <- 'distance_metrics/' # csv in here contains a symmetric matrix
clust.dir <- "clusters/" #csv in here contains a column vector with classes
my.data <- read.csv(paste(dir, filename, sep=""), header = T, row.names = 1)
filename <- 'table.csv'
my.dist <- as.dist(my.data)
real.clusters <-read.csv(paste(clust.dir, filename, sep=""), header = T, row.names = 1)
clustered <- diana(my.dist)
dnd <- as.dendrogram(clustered)
borgr
  • 20,175
  • 6
  • 25
  • 35

3 Answers3

1

Both node and edge color attributes can be set recursively on "dendrogram" objects (which are just deeply nested lists) using dendrapply. The cluster package also features an as.dendrogram method for "diana" class objects, so conversion between the object types is seamless. Using your diana clustering and borrowing some code from @Edvardoss iris example, you can create the colored dendrogram as follows:

library(cluster)
set.seed(999)
iris2 <- iris[sample(x = 1:150,size = 50,replace = F),]
clust <- diana(iris2)
dnd <- as.dendrogram(clust)

## Duplicate rownames aren't allowed, so we need to set the "labels"
## attributes recursively. We also label inner nodes here. 
rectify_labels <- function(node, df){
  newlab <- df$Species[unlist(node, use.names = FALSE)]
  attr(node, "label") <- (newlab)
  return(node)
}
dnd <- dendrapply(dnd, rectify_labels, df = iris2)

## Create a color palette as a data.frame with one row for each spp
uniqspp <- as.character(unique(iris$Species))
colormap <- data.frame(Species = uniqspp, color = rainbow(n = length(uniqspp)))
colormap[, 2] <- c("red", "blue", "green")
colormap

## Now color the inner dendrogram edges
color_dendro <- function(node, colormap){
  if(is.leaf(node)){
    nodecol <- colormap$color[match(attr(node, "label"), colormap$Species)]
    attr(node, "nodePar") <- list(pch = NA, lab.col = nodecol)
    attr(node, "edgePar") <- list(col = nodecol)
  }else{
    spp <- attr(node, "label")
    dominantspp <- levels(spp)[which.max(tabulate(spp))]
    edgecol <- colormap$color[match(dominantspp, colormap$Species)]
    attr(node, "edgePar") <- list(col = edgecol)
  }
  return(node)
}
dnd <- dendrapply(dnd, color_dendro, colormap = colormap)

## Plot the dendrogram
plot(dnd)

enter image description here

Shaun Wilkinson
  • 473
  • 1
  • 4
  • 11
  • 1
    Is there a generic way not specifying the colors by words(and hence not restricting to a predefined number of classes)? – borgr Aug 17 '17 at 07:21
  • Sure, you can just use the RGB hexadecimal format "#RRGGBB" to specify any color in the spectrum. For example, try replacing `c("red", "blue", "green")` above with `c("#B44682", "#82B446", "#4682B4")`. – Shaun Wilkinson Aug 17 '17 at 23:36
  • but this is still by hand, can't I use the rainbow or something without hard coding names of colors (in words or in rgb) – borgr Aug 20 '17 at 13:19
  • 1
    Yes just replace `c("red", "blue", "green")` with `rainbow(n = length(uniqspp))` – Shaun Wilkinson Aug 20 '17 at 23:51
1

The function you are looking for is color_brances from the dendextend R package, using the arguments clusters and col. Here is an example (based on Shaun Wilkinson's example):

library(cluster)
set.seed(999)
iris2 <- iris[sample(x = 1:150,size = 50,replace = F),]
clust <- diana(iris2)
dend <- as.dendrogram(clust)

temp_col <- c("red", "blue", "green")[as.numeric(iris2$Species)]
temp_col <- temp_col[order.dendrogram(dend)]
temp_col <- factor(temp_col, unique(temp_col))

library(dendextend)
dend %>% color_branches(clusters = as.numeric(temp_col), col = levels(temp_col)) %>% 
   set("labels_colors", as.character(temp_col)) %>% 
   plot

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • I thought it was the right solution but then I tried it on real data and noticed, it stops when the cluster is not made of exactly one "real cluster". how would you change the color to be the color of the mose represented real cluster under the edge instead of by the only only represented? real life data has noise and hence don't split so nicely. or did I miss something? – borgr Aug 16 '17 at 14:47
  • Hi, could you produce a simple example to illustrate the problem? (I assume this is a bug) If so - you can post it to github.com/talgalili/dendextend/issues – Tal Galili Aug 28 '17 at 23:09
  • you can see it also in your example, some of the top branches are black I just had more black ones... – borgr Aug 29 '17 at 05:44
  • The black branches are because they do not belong to any of the sub clusters. This is a feature, not a bug. – Tal Galili Dec 03 '17 at 22:53
  • What you could do is use color_branches with k smaller than the number of clusters you wanted, and it would color the higher branches for you (but in slightly different colors) – Tal Galili Dec 03 '17 at 22:55
0

there are suspicions that misunderstood the question however I'll try to answer: from my previous objectives were rewritten by the example of iris

clrs <- rainbow(n = 3) # create palette
clrs <- clrs[iris$Species] # assign colors
plot(x = iris$Sepal.Length,y = iris$Sepal.Width,col=clrs) # simple test colors
# cluster
dt <- cbind(iris,clrs)
dt <- dt[sample(x = 1:150,size = 50,replace = F),] # create short dataset for visualization convenience
empty.labl <- gsub("."," ",dt$Species) # create a space vector with length of names intended for  reserve place to future text labels
dst <- dist(x = scale(dt[,1:4]),method = "manhattan")
hcl <- hclust(d = dst,method = "complete")
plot(hcl,hang=-1,cex=1,labels = empty.labl, xlab = NA,sub=NA)
dt <- dt[hcl$order,] # sort rows for  order objects in dendrogramm
text(x = seq(nrow(dt)), y=-.5,labels = dt$Species,srt=90,cex=.8,xpd=NA,adj=c(1,0.7),col=as.character(dt$clrs))

result

Edvardoss
  • 393
  • 3
  • 8
  • Thanks for the answer, but I look for colorings on the clusters\edges\lines. Not only on the ticks\label names. The questions is how to choose the colorings using the labels name (e.g. color by the most frequent label color in the cluster) – borgr Aug 01 '17 at 11:27