I'm using hierarchical clustering to pull out a set number of clusters from a dataset. My objective is to test how robust the clustering solution is when I reduce the amount of data used (and potentially the variables included). I think this means subsampling the data, and then making a new distance matrix, and a new dendrogram each time I adjust something. One way I can think to measure sensitivity of the clustering solution is to compare the cluster centroids made with full data to those made with a subset of the data, I could do this by projecting them in PCoA space and calculating distance between cluster centroids (in PCoA space). This is close to what the betadisper function from package vegan does (apart from it calculates distance of points in the cluster to the centroid). However, my problem is that if I have created different distance matrices when subsampling, then the PCoA space will be different between subsample runs, and therefore non-comparable. Is it possible to simply standardise the PCoA space from different subsample runs to make them comparable?
Any pointers or alternative approaches would be greatly appreciated,
Mark
library(vegan)
# my data has categorial variables so I'll use gower with the iris dataset for example
mydist<-dist(iris[,1:4])
# Pull, out 3 clusters
hc_av<-hclust(d=mydist, method='average')
my_cut<-cutree(hc_av, 3)
# calc distance to cluster centre
mod<-betadisper(mydist, my_cut)
mod
plot(mod)
# randomly remove 5% of data and recalc as above - this would be bootstrapped
mydist2<-dist(iris[sort(sample(1:150, 145)),1:4])
# Pull, out 3 clusters
hc_av2<-hclust(d=mydist2, method='average')
my_cut2<-cutree(hc_av2, 3)
# calc distance to cluster centre
mod2<-betadisper(mydist2, my_cut2)
mod2
par(mfrow=c(1,2))
plot(mod, main='full model'); plot(mod2, main='subset')
# How can I to calculate the distance each cluster centroid has moved when
subsampling the data relative to the full model?