Dist and hclust functions outputting unexpected/incorrect outputs

Question

I have been attempting to use R as an alternative to MVSP for cluster analysis and PCA. However, R is giving drastically different outputs from MVSP using all the functions that I've found, including dist, bcdist, hclust and daisy functions. I have also used the distance table output from MVSP as a distance matrix in R, which produces a further different output. It requires a cluster/dendrogram using euclidean distance, sorenson-dice coefficient and average/UPGMA clustering. The problems have been duplicated by a second person/input as well as by myself on multiple computers and two versions. SAS has given the same results as MVSP.

Is there another package I could use (alternative to dist or hclust) or a way to view/test the algorithm in R? Is there something that could cause this and could backdating the version of R to a significantly earlier one work?

Edit; found the problem, the algorithm used in all the functions I tried were not using Sorenson's coefficient, so I used the Proxy package, with the dist function (and the inbuilt method="Dice").

It wouldn't be very hard to code up a simple 'dist' function to get a distance matrix for the rows or columns of a dataset. You can then compare that to your SAS/MVSP/R outputs to try and determine where the distance lies. — Frank P., Nov 25 '13 at 20:00
You can add your own answer. It would preferably include code and example data-objects — IRTFM, Nov 30 '13 at 19:55

score 2 · Answer 1 · answered Nov 25 '13 at 21:06

Believe me, hclust() and dist() have been used gazillion of times and also looked at by many. Similarly with their counterparts in the recommended package cluster, the functions agnes() and daisy(). Hierarchical clustering algorithms in practice have to decide how to order the dendrogram branches (at each split: what goes left, what goes right?) and e.g. agnes() and hclust() differ in their left-right assignment strategy but otherwise clearly coincide iff the same method is chosen. Have you carefully read the help pages? E.g. hclust defaults to "complete" whereas agnes() defaults to the more sensible "average".

But if you already see "problems" for dist(), then you must not yet have mastered on how to get your data into R properly.. or something similar!

Let's use simple one dimensional data, (so there are no problems with dist()), namely the first 5 prime numbers, and show what R does, and then try to prove why this should not be 100% correct

> (D <- dist(setNames(, c(2,3,5,7,11))))
   2 3 5 7
3  1      
5  3 2    
7  5 4 2  
11 9 8 6 4
> hc <- hclust(D)
> plot(hc) ## --- see the attached image [1]

and now the same with agnes() from package cluster, carefully ensuring that the same method is used:

> library(cluster)
> ag <- agnes(D, method="complete")
> print.default(hc[1:3])
$merge
 [,1] [,2]
[1,]   -1   -2
[2,]   -3   -4
[3,]    1    2
[4,]   -5    3

$height
[1] 1 2 5 9

$order
[1] 5 1 2 3 4

>

I've used the unconvential printing of the first three inner components just to illustrate what they are numerically (and if you study the plot and the output, you may start guessing what they mean ...).

Now you tell use what would not be correct here.

Hmm, the image somehow did not "work" ... well, it's not hard to get it on your screen, as you have R! — Martin Mächler, Nov 25 '13 at 21:08
I've been inputting as a small table of 12 rows by 5 columns, should this still work? Not sure if that could be whats throwing it off. Thank you. — Mirran, Nov 25 '13 at 23:48

Dist and hclust functions outputting unexpected/incorrect outputs

1 Answers1