4

I have clustered a large dataset and found 6 clusters I am interested in analyzing more in depth.

I found the clusters using hclust with "ward.D" method, and I would like to know whether there is a way to get "sub-trees" from hclust/dendrogram objects.

For example

library(gplots)
library(dendextend)

data <- iris[,1:4]
distance <- dist(data, method = "euclidean", diag = FALSE, upper = FALSE)
hc <- hclust(distance, method = 'ward.D')
dnd <- as.dendrogram(hc)
plot(dnd) # to decide the number of clusters
clusters <- cutree(dnd, k = 6)

I used cutree to get the labels for each of the rows in my dataset.

I know I can get the data for each corresponding cluster (cluster 1 for example) with:

c1_data = data[clusters == 1,]

Is there any easy way to get the subtrees for each corresponding label as returned by dendextend::cutree? For example, say I am interesting in getting the

I know I can access the branches of the dendrogram doing something like

subtree <- dnd[[1]][[2]

but how I can get exactly the subtree corresponding to cluster 1?

I have tried

dnd[clusters == 1]

but this of course doesn't work. So how can I get the subtree based on the labels returned by cutree?

lucacerone
  • 9,859
  • 13
  • 52
  • 80

3 Answers3

4

================= UPDATED answer

This can now be solved using the get_subdendrograms from dendextend.

# needed packages:
# install.packages(gplots)
# install.packages(viridis)
# install.packages(devtools)
# devtools::install_github('talgalili/dendextend') # dendextend from github

# define dendrogram object to play with:
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>%  set("labels_to_character") %>% color_branches(k=5)
dend_list <- get_subdendrograms(dend, 5)

# Plotting the result
par(mfrow = c(2,3))
plot(dend, main = "Original dendrogram")
sapply(dend_list, plot)

enter image description here

This can also be used within a heatmap:

# plot a heatmap of only one of the sub dendrograms
par(mfrow = c(1,1))
library(gplots)
sub_dend <- dend_list[[1]] # get the sub dendrogram
# make sure of the size of the dend
nleaves(sub_dend)
length(order.dendrogram(sub_dend))
# get the subset of the data
subset_iris <- as.matrix(iris[order.dendrogram(sub_dend),-5])
# update the dendrogram's internal order so to not cause an error in heatmap.2
order.dendrogram(sub_dend) <- rank(order.dendrogram(sub_dend))
heatmap.2(subset_iris, Rowv = sub_dend, trace = "none", col = viridis::viridis(100))

enter image description here

================= OLDER answer

I think what can be helpful for you are these two functions:

The first one just iterates through all clusters and extracts substructure. It requires:

  • the dendrogram object from which we want to get the subdendrograms
  • the clusters labels (e.g. returned by cutree)

Returns a list of subdendrograms.

extractDendrograms <- function(dendr, clusters){
    lapply(unique(clusters), function(clust.id){
        getSubDendrogram(dendr, which(clusters==clust.id))
    })
}

The second one performs a depth-first search to determine in which subtree the cluster exists and if it matches the full cluster returns it. Here, we use the assumption that all elements of a cluster are in one subtress. It requires:

  • the dendrogram object
  • positions of the elements in cluster

Returns a subdendrograms corresponding to the cluster of given elements.

getSubDendrogram<-function(dendr, my.clust){
    if(all(unlist(dendr) %in% my.clust))
        return(dendr)
    if(any(unlist(dendr[[1]]) %in% my.clust ))
        return(getSubDendrogram(dendr[[1]], my.clust))
    else 
        return(getSubDendrogram(dendr[[2]], my.clust))
}

Using these two functions we can use the variables you have provided in the question and get the following output. (I think the line clusters <- cutree(dnd, k = 6) should be clusters <- cutree(hc, k = 6) )

my.sub.dendrograms <- extractDendrograms(dnd, clusters)

plotting all six elements from the list gives all subdendrograms

enter image description here

EDIT

As suggested in the comment, I add a function that as an input takes a dendrogram dend and the number of subtrees k, but it still uses the previously defined, recursive function getSubDendrogram:

prune_cutree_to_dendlist <- function(dend, k, order_clusters_as_data=FALSE) {
    clusters <- cutree(dend, k, order_clusters_as_data)
    lapply(unique(clusters), function(clust.id){    
        getSubDendrogram(dend, which(clusters==clust.id))
    })
}

A test case for 5 substructures:

library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% set("labels_to_character") %>% color_branches(k=5)

subdend.list <- prune_cutree_to_dendlist(dend, 5)

#plotting
par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)

I have performed some benchmark using rbenchmark with the function suggested by Tal Galili (here named prune_cutree_to_dendlist2) and the results are quite promising for the DFS approach from the above:

library(rbenchmark)
benchmark(prune_cutree_to_dendlist(dend, 5), 
          prune_cutree_to_dendlist2(dend, 5), replications=5)

                                test replications elapsed relative user.self
1  prune_cutree_to_dendlist(dend, 5)            5    0.02        1     0.020
2 prune_cutree_to_dendlist2(dend, 5)            5   60.82     3041    60.643

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
storaged
  • 1,837
  • 20
  • 34
  • Hi @storaged. I've tried running your code and it didn't work for me. Can you produce one function that gets a dendrogram and a k and returns a list with the k sub dendrograms? If so, we can compare it with my code. If it works (and is faster), I'd be happy to have you add it to dendextened via a pull request (https://github.com/talgalili/dendextend/issues/61 ) – Tal Galili Jan 10 '18 at 05:41
  • @TalGalili your suggestion was really interesting. For now I added a wrapper function that uses the sub functions from the first version of the post. It works for me as shown in the test case. I can try to do some time benchmarks if you are really interested? – storaged Jan 10 '18 at 14:57
  • I added also the benchmark. It seems that DFS approach works really well :) I'd be really grateful if you confirm that the code works for you now. – storaged Jan 10 '18 at 15:54
  • Hi @storaged. I finally got to have a look at your code, and it makes perfect sense to me that it works this fast compared to the solution I provided. I think this should be committed to dendextend. Do you know how to make a fork and a poll request? – Tal Galili Jan 10 '18 at 23:52
  • @TalGalili I have done fork/pull request, you can check if it is ok. I tried to provide a bit of documentation also. I am really glad that it is helpful! – storaged Jan 11 '18 at 11:56
  • @storaged Tal kindly proposed to mark your solution as the correct one. I didn't because for some reason it didn't work when I tried, but I really looking forward to use it from dendextend! Thank you both for your help with this! – lucacerone Jan 11 '18 at 21:06
  • Hi @lucacerone - I've now merged storaged answer/functions into dendextend, and updated this question with the solution (including the heatmap.2 addition). This function should (hopefully) be fast enough to deal with your data. Cheers – Tal Galili Jan 19 '18 at 20:12
2

I wrote now function prune_cutree_to_dendlist to do what you asked for. I should add it to dendextend at some point in the future.

In the meantime, here is an example of the code and output (the function is a bit slow. Making it faster relies on having prune be faster, which I won't get to fixing in the near future.)

# install.packages("dendextend")

library(dendextend)
dend <- iris[,-5] %>% dist %>% hclust %>% as.dendrogram %>% 
  set("labels_to_character")
dend <- dend %>% color_branches(k=5)

# plot(dend)

prune_cutree_to_dendlist <- function(dend, k) {
  clusters <- cutree(dend,k, order_clusters_as_data = FALSE)
  # unique_clusters <- unique(clusters) # could also be 1:k but it would be less robust
  # k <- length(unique_clusters)
  # for(i in unique_clusters) { 
  dends <- vector("list", k)
  for(i in 1:k) { 
    leves_to_prune <- labels(dend)[clusters != i]
    dends[[i]] <- prune(dend, leves_to_prune)

  }

  class(dends) <- "dendlist"

  dends
}

prunned_dends <- prune_cutree_to_dendlist(dend, 5)
sapply(prunned_dends, nleaves)

par(mfrow = c(2,3))
plot(dend, main = "original dend")
sapply(prunned_dends, plot)

enter image description here

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • Tal thank you so much! I wondered why it wasn't in dendextend already and was thinking to make a PR but you beat me on time! Thank you so much! I'll use your function and am really waiting for it to be integrated in dendextend! ps. is there any vignette on the dendextend internals, something like an "extend dendextend" tutorial? – lucacerone Jan 10 '18 at 08:23
  • Hi lucacerone, I think storaged solution should be marked as the correct solution. While both my solution and his solutions are correct, his solution is more tailored to the problem and will allow a MUCH faster response time. I'll add his function (with some modifications) to dendextend. I'll also edit his answer once it is done (but I think he earned his karma points more than me :) ) – Tal Galili Jan 10 '18 at 23:54
  • 1
    I would like to thank you both for helping me with this :) for some reason I wasn´t able to run storaged solution, while yours run on the first run (but as you say it's quite slow and in the end I decided to skip this step in my analysis).. with approx ~20000 rows to cluster I couldn't get results after 15 minutes... Thanks for the help again and for the fair play, I'll mark his solution then :) – lucacerone Jan 11 '18 at 21:01
0

How did you get 6 clusters using hclust? You can cut the tree at any point, so you just ask cuttree to give you more clusters:

clusters = cutree(hclusters, number_of_clusters)

If you have a lot of data this may not be very handy though. In these cases what I do is manually picking the clusters that I want to study further and then running hclust only on the data in these clusters. I don't know of any functionality in hclust that allows you to do this automatically, but it's quite easy:

good_clusters = c(which(clusters==1), 
                  which(clusters==2)) #or whichever cLusters you want
new_df = df[good_clusters,]
new_hclusters = hclust(new_df)
new_clusters = cutree(new_hclusters, new_number_of_clusters)
Eudald
  • 358
  • 3
  • 12