0

I have a dataset with multiple observations for each category:

country PC1 PC2 PC3 PC4 PC5
BD  0.0960408090569664  0.373740208940467   -0.369920989335273  -1.02993010449105   -0.481901935725247
BD  -0.538617581045194  0.537010643603669   0.447050616992454   -1.3888975041278    -0.759524281163431
PK  -0.452943925236246  0.507244835779749   0.64679762176707    -1.38054973938184   -0.278384245105666
PK  -1.01487954986928   0.737191371806965   -0.202656866687033  -1.22663700666619   0.186305912881529
UK  -0.377594639422628  0.817593863033578   0.3739216019342 -1.73856626173224   1.12404906217336
UK  -0.636564327570674  0.714647668634421   1.00488527275837    -1.4344227886331    0.637219423443802
US  -0.775649983771687  0.0900448150403809  0.243317360780493   -1.72498526814162   -0.618714136277983
US  -0.372815509141658  0.419096654055852   0.904247466040119   -0.573219421959129  -0.0154666267035251

I want to run hierarchical cluster analysis on it in R, such that there are only 4 nodes (corresponding to 4 levels of country). The only way I can think of is to take mean values of the columns (PC1, PC2...) based on country and then run hclust in R. Since I have multiple observations for each categorical variable (there are at least 200 for each level), I want to run a bootstrap version of hierarchical cluster analysis on thousands of sub-samples (by randomly selecting one observation for each categorical variable) and running hclust, and then get a final result. I have come across the following ways of bootstrap clustering. pvclust appears to be useful for the summarised version of this data. ClusterBootstrap and Bclust also do not look useful for my scenario. Any ideas how can I run bootstrap using sub-samples of actual observations instead of using the summarized version with /without replacement?

Shakir
  • 343
  • 5
  • 23

1 Answers1

0

Bootstrap cluster analysis is possible as follows:

library(future)
plan(multisession)
library(shipunov)
library(dplyr)
data = data.frame(country = c(rep("PK", 10), rep("UK", 10), rep("US", 10), rep("BD", 10), rep("IN", 10)),
                  "PC1" = runif(n = 50, min = -2, max = 3),
                  "PC2" = runif(n = 50, min = -2.5, max = 4),
                  "PC3" = runif(n = 50, min = -4, max = 2))
#original that will be used for comparison
d1 = data |> 
  dplyr::group_by(country) |> 
  dplyr::summarise_if(is.numeric, mean) |> 
  tibble::column_to_rownames(var="country") |> 
  data.frame()
dist_mat <- dist(d1, method = 'euclidean')
list_of_hc <- furrr::future_map(1:20000, function(i) {
  print(i)
  ##create a dataframe with replacement using original df and summarize it
  d = data |> group_by(country) |> slice_sample(prop = 1, replace=TRUE) |>
    ungroup() |> 
    dplyr::group_by(country) |> 
    dplyr::summarise_if(is.numeric, mean) |> 
    tibble::column_to_rownames(var="country")
  ##run hclust on the data
  dist_mat = dist(d, method = 'euclidean')
  hc =  hclust(dist_mat)
  ##save the hclust result to a list
  hc}, .progress = TRUE)
#first element of the list is based on original df
list_of_hc[[1]] <- hclust(dist_mat)
#use Bclust to calculate similarity b/w the original (first element) and subsequent bootstrapped hclust
(bb3 <- Bclust(hclist=list_of_hc, relative = TRUE))
plot(bb3)

Result: result of Bclust as a plot

Shakir
  • 343
  • 5
  • 23