1

I'm using hierarchical clustering to pull out a set number of clusters from a dataset. My objective is to test how robust the clustering solution is when I reduce the amount of data used (and potentially the variables included). I think this means subsampling the data, and then making a new distance matrix, and a new dendrogram each time I adjust something. One way I can think to measure sensitivity of the clustering solution is to compare the cluster centroids made with full data to those made with a subset of the data, I could do this by projecting them in PCoA space and calculating distance between cluster centroids (in PCoA space). This is close to what the betadisper function from package vegan does (apart from it calculates distance of points in the cluster to the centroid). However, my problem is that if I have created different distance matrices when subsampling, then the PCoA space will be different between subsample runs, and therefore non-comparable. Is it possible to simply standardise the PCoA space from different subsample runs to make them comparable?

Any pointers or alternative approaches would be greatly appreciated,

Mark

library(vegan)

# my data has categorial variables so I'll use gower with the iris dataset for example
mydist<-dist(iris[,1:4])
# Pull, out 3 clusters
hc_av<-hclust(d=mydist, method='average')
my_cut<-cutree(hc_av, 3)
# calc distance to cluster centre
mod<-betadisper(mydist, my_cut)
mod
plot(mod)

# randomly remove 5% of data and recalc as above - this would be bootstrapped

mydist2<-dist(iris[sort(sample(1:150, 145)),1:4])
# Pull, out 3 clusters
hc_av2<-hclust(d=mydist2, method='average')
my_cut2<-cutree(hc_av2, 3)
# calc distance to cluster centre
mod2<-betadisper(mydist2, my_cut2)
mod2
par(mfrow=c(1,2))
plot(mod, main='full model'); plot(mod2, main='subset')
# How can I to calculate the distance each cluster centroid has moved when 
subsampling the data relative to the full model?
  • I suggest you have a look at this: https://cran.r-project.org/web/packages/dendextend/vignettes/introduction.html#comparing-two-dendrograms – Tal Galili Jan 23 '19 at 08:39
  • PCoA will reproduce original distances when scaled appropriately with eigenvalues (with a slight complication when you have negative eigenvalues). This means that *distances* between cluster centroids are comparable among 95% subsamples as long as you can identify "same" clusters among cluster runs. – Jari Oksanen Jan 23 '19 at 14:16
  • Thanks Tal and Jari - apologies for my slow reply, I've had a look at dendextend, and can certainly see it's use, but for my particular problem I couldn't figure out how I would compare dendrograms created with different amounts of data? Jari, so I understand correctly are you saying that I can simply calculate the difference between the centroid position of the 100% model and centroid position of the 95% model? e.g. using the code above: `scores(mod)$centroids[1,]-scores(mod2)$centroids[1,]` – majordyRule Feb 08 '19 at 11:40
  • @Jari actually I suppose calculating difference between centroids using all PCoA axes would be preferable to just the first two e.g. `dist(rbind(mod$centroids[1,], mod2$centroids[1,]))` However, would I need to weight the calculation knowing that the first few axes explain a lot more than the later ones? – majordyRule Feb 18 '19 at 16:46

0 Answers0