24

I have two dendrograms which I wish to compare to each other in order to find out how "similar" they are. But I don't know of any method to do so (let alone a code to implement it, say, in R).

Any leads ?

UPDATE (2014-09-13):

Since asking this question, I have written an R package called dendextend, for the visualization, manipulation and comparison of dendrogram. This package is on CRAN and comes with a detailed vignette. It includes functions such as cor_cophenetic, cor_bakers_gamma and Bk / Bk_plot. As well as a tanglegram function for visually comparing two trees.

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • 1
    ::looks up dendrogram:: Now you have me curious. What metric exist for such comparisons in the first place? – dmckee --- ex-moderator kitten Feb 07 '10 at 21:41
  • 2
    Are you sure you want to do this? The dendrograms are just a representation of the data. I would think that comparing (directly) the data partitioned in those two dendrograms would be more informative. – doug Feb 16 '10 at 18:51

6 Answers6

17

Comparing dendrograms is not quite the same as comparing hierarchical clusterings, because the former includes the lengths of branches as well as the splits, but I also think that's a good start. I would suggest you read E. B. Fowlkes & C. L. Mallows (1983). "A Method for Comparing Two Hierarchical Clusterings". Journal of the American Statistical Association 78 (383): 553–584 (link).

Their approach is based on cutting the trees at each level k, getting a measure Bk that compares the groupings into k clusters, and then examining the Bk vs k plots. The measure Bk is based upon looking at pairs of objects and seeing whether they fall into the same cluster or not.

I am sure that one can write code based on this method, but first we would need to know how the dendrograms are represented in R.

Aniko
  • 18,516
  • 4
  • 48
  • 45
  • That is VERY helpful Aniko - thank you! I will read further into this. – Tal Galili Feb 08 '10 at 16:37
  • 3
    Dear Aniko, Since I started this thread, I have written an R package called _dendextend_ with several functions for comparing dendrograms. Specficially: `cor_cophenetic`, `cor_bakers_gamma` and `Bk` / `Bk_plot`. The package also comes with a detail vignette which explains these functions. – Tal Galili Sep 13 '14 at 08:48
  • A link to the vignette: http://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html – Tal Galili Sep 13 '14 at 09:32
5

As you know, Dendrograms arise from hierarchical clustering - so what you are really asking is how can I compare the results of two hierarchical clustering runs. There are no standard metrics I know of, but I would be looking at the number of clusters found and comparing membership similarity between like clusters. Here is a good overview of hierarchical clustering that my colleague wrote on clustering scotch whiskey's.

Paul
  • 5,376
  • 1
  • 20
  • 19
3

have a look at this page:

I also have similar question asked here

It seems we can use cophenetic correlation to measure the similarity between two dendrograms. But there seems no function for this purpose in R currently.

EDIT at 2014,9,18: The cophenetic function in stats package is capable to calculating the cophenetic dissimilarity matrix. and the correlation can be calculated using cor function. as @Tal has pointed the as.dendrogram function returned the tree with different order, which will cause wrong results if we calculate the correlation based on the dendrogram results. As showed in the example of function cor_cophenetic function in dendextend package:

set.seed(23235)
ss <- sample(1:150, 10 )
hc1 <- iris[ss,-5] %>% dist %>% hclust("com")
hc2 <- iris[ss,-5] %>% dist %>% hclust("single")
dend1 <- as.dendrogram(hc1)
dend2 <- as.dendrogram(hc2)
# cutree(dend1)
cophenetic(hc1)
cophenetic(hc2)
# notice how the dist matrix for the dendrograms have different orders:
cophenetic(dend1)
cophenetic(dend2)
cor(cophenetic(hc1), cophenetic(hc2)) # 0.874
cor(cophenetic(dend1), cophenetic(dend2)) # 0.16
# the difference is becasue the order of the distance table in the case of
# stats:::cophenetic.dendrogram will change between dendrograms!
Community
  • 1
  • 1
pengchy
  • 732
  • 2
  • 14
  • 26
  • 1
    Dear @pengchy - there is a function like that in R. It is the `cor_cophenetic` function, from the _dendextend_ package. – Tal Galili Sep 13 '14 at 08:44
1

If you have access to the underlying distance matrix that generated each dendrogram (you probably do if you generated the dendorograms in R), couldn't you just use correlation between the corresponding values of the two matrices? I know this doesn't address the letter of what you asked, but it's a good solution to the spirit of what you asked.

dsimcha
  • 67,514
  • 53
  • 213
  • 334
  • Hi dsimcha, Thanks for the idea. In my particular situation, I have the distance matrix for only one of the two. So your solution is not applicable. But thanks again! – Tal Galili Feb 08 '10 at 10:26
1

There is a rich body of literature for tree distance metrics in the phylogenetics community that seems to have been neglected from the computer science perspective. See dist.topo of the ape package for two tree distance metrics and several citations (Penny and Hardy 1985, Kuhner and Felsenstein 1994) which considering the similarity of tree partitions, and also the Robinson-Foulds metric which has an R implementation in the phangorn package.

One problem is that these metrics don't have a fixed scale, so they are only useful in the cases of 1) tree comparison or 2) comparison to some generated baseline, perhaps via permutation tests similar to what Tal has done with Baker's Gamma in his fantastic dendextend package.

If you have hclust or dendrogram objects generated from R hierarchical clustering, using as.phylo from the ape package will convert your dendrograms to phylogenetic trees for usage in these functions.

jayelm
  • 7,236
  • 5
  • 43
  • 61
1

Take a look at this page that has lots of information about software that deals with trees, including dendrograms. I noticed several tools that deal with tree comparison, although I haven't personally used any of them yet. There are a number of references cited there also.

kc2001
  • 5,008
  • 4
  • 51
  • 92