Different results for `hclust' and `agnes' using average link

Question

I am applying a simple clustering procedure to a custom simulated similarity matrix. (https://github.com/ewouddt/Files/blob/master/sim_col.RData)

However I am noticing a difference between the hclust and agnes procedure when using an average link (Note: I observed the same behaviour for a complete link as well)

load("sim_col.RData") # A 606 x 606 similarity matrix
library(cluster)

c1 <- hclust(as.dist(1-sim_col),method="average")
c2 <- as.hclust(agnes(as.dist(1-sim_col),diss=TRUE,method="average"))

dev.new()
plot(c1)
dev.new()
plot(c2)

cut1 <- cutree(c1,k=20)
cut2 <- cutree(c2,k=20)
cut1
cut2

sort(table(cut1))
cut1
# cut1
# 10  18   9  19   3  20   4  11   7  15  17   5   6  12  16   2   8   1  13  14 
#  2   5   7   8  11  13  14  14  15  19  19  21  23  26  27  31  33  80  95 143
sort(table(cut2))
# cut2
# 18  20  19  11  17   7   8   4  12   5   9   3  10  16   2   6  14  13   1  15 
#  4   6   8   9   9  13  13  14  15  16  17  19  20  29  31  31  54  62 115 121

As expected the dendrograms look different due the different order of hclust and agnes. However cutting the labels (at k=20 for example) shows different (although similar) results for the observations. (For example you can see that the quantities of the labels differs between the 2 results)

Am I making a stupid mistake or are hclust and agnes not supposed to return the exact same result after cutting the tree? If the 2 procedures are not supposed to return the same result, wherein lies the difference of the 2 functions?

score 1 · Accepted Answer · answered Mar 11 '17 at 12:28

Except for single-link, the clustering result may not be uniquely determined.

Consider the following data set:

1 2 3 4

The are three minima: merging 1 and 2, or 2 and 3, or 3 and 4.

Except for single-link, we will get different results depending on whether we first merge 2 and 3 or one of the other pairs.

In particular, the usual algorithms will not be able to guarantee to find the optimal solution. If you would want to guarantee that, you will likely be NP-complete. But it may also not matter much.

Different results for `hclust' and `agnes' using average link

1 Answers1

Linked