I have a dataset with multiple observations for each category:
country PC1 PC2 PC3 PC4 PC5
BD 0.0960408090569664 0.373740208940467 -0.369920989335273 -1.02993010449105 -0.481901935725247
BD -0.538617581045194 0.537010643603669 0.447050616992454 -1.3888975041278 -0.759524281163431
PK -0.452943925236246 0.507244835779749 0.64679762176707 -1.38054973938184 -0.278384245105666
PK -1.01487954986928 0.737191371806965 -0.202656866687033 -1.22663700666619 0.186305912881529
UK -0.377594639422628 0.817593863033578 0.3739216019342 -1.73856626173224 1.12404906217336
UK -0.636564327570674 0.714647668634421 1.00488527275837 -1.4344227886331 0.637219423443802
US -0.775649983771687 0.0900448150403809 0.243317360780493 -1.72498526814162 -0.618714136277983
US -0.372815509141658 0.419096654055852 0.904247466040119 -0.573219421959129 -0.0154666267035251
I want to run hierarchical cluster analysis on it in R, such that there are only 4 nodes (corresponding to 4 levels of country
). The only way I can think of is to take mean values of the columns (PC1
, PC2
...) based on country
and then run hclust
in R. Since I have multiple observations for each categorical variable (there are at least 200 for each level), I want to run a bootstrap version of hierarchical cluster analysis on thousands of sub-samples (by randomly selecting one observation for each categorical variable) and running hclust
, and then get a final result. I have come across the following ways of bootstrap clustering. pvclust appears to be useful for the summarised version of this data. ClusterBootstrap and Bclust also do not look useful for my scenario. Any ideas how can I run bootstrap using sub-samples of actual observations instead of using the summarized version with /without replacement?