1

I am working in R, with the dendextend package, trying to compare hclusts objects with cop_cophenetic.

I have two objects that rise from clustering: clusts and clusts1, and I want to compare the cophenetic correlation between them. I have a few options as below:

cor_cophenetic(as.phylo(clusts), as.phylo(clusts1))
[1] 0.1632751
cor_cophenetic(as.dendrogram(clusts), as.dendrogram(clusts1))
[1] 0.1632751
cor_cophenetic(clusts, clusts1)
[1] 0.689649
cor_cophenetic(as.phylo.hclust(clusts), as.phylo.hclust(clusts1))
[1] 0.1632751

I can also try a more direct approach with base R

cor(as.vector(cophenetic(clusts)), as.vector(cophenetic(clusts1)))
[1] 0.689649

First, I don't understand the difference between calling cor_cophenetic on the hclusts objects, to calling cor_cophenetic on the dendrograms, or phylos. Is there a correct way here?

Next, I try to do a randomization test on the labels of clusts1.

per <- sample(length(clusts1$labels))
clusts1$labels <- clusts1$labels[per]

While the cophenetic on the dendros vary on the randomizations (I get a distribution). The direct cophenetic on the hclusts stays fixed (0.689649) - and does not change. Why is it?

Greenonline
  • 1,330
  • 8
  • 23
  • 31
erezgrn
  • 11
  • 1

1 Answers1

0

The thing to remember when using cophenetic correlation is that the (cophenetic) distance matrix of the two trees MUST be ordered in the same way so to make the check comparable. So rotating the trees or changing their data type structure should not make a difference on the value. What you are reporting is a potential bug. But I can't reproduce it. Here is an example that gives the proper results:

library(dendextend)
dend15 <- c(1:5) %>% dist %>% hclust(method = "average") %>% as.dendrogram %>% set("labels", as.character(labels(.)))
dend51 <- dend15 %>% set("labels", as.character(5:1)) %>% match_order_by_labels(dend15)
dend15_r <- rev(dend15)
tanglegram(dend15 ,dend15_r )
tanglegram(dend15 ,dend51 )

cor_cophenetic(dend15 ,dend15_r )
cor_cophenetic(dend15 ,dend51 )

cor_cophenetic(as.hclust(dend15),as.hclust(dend15_r) )
cor_cophenetic(as.hclust(dend15) ,as.hclust(dend51) )

output:

> 
> cor_cophenetic(dend15 ,dend15_r )
[1] 1
> cor_cophenetic(dend15 ,dend51 )
[1] 0.3125
> 
> cor_cophenetic(as.hclust(dend15),as.hclust(dend15_r) )
[1] 1
> cor_cophenetic(as.hclust(dend15) ,as.hclust(dend51) )
[1] 0.3125
> 

First two trees (no topological difference - cor of 1) enter image description here

Second comparison of two trees (with a topological difference - cor of 0.31) enter image description here

Please create a small self contained example to reproduce this issue, and post it here: https://github.com/talgalili/dendextend/issues

Tal Galili
  • 24,605
  • 44
  • 129
  • 187
  • Hi Tal, thanks for the answer! I looked into it and it seems that when you use cor_cophenetic on objects of type phylo or dendro you don't have to take of the order, while when you pass objects of type hclust you have to pay more attention to the order of the labels. – erezgrn Jul 12 '17 at 11:53