Consider this data frame:
set.seed(123)
dat1 <- data.frame(Region = rep(c("r1","r2"), each = 100),
State = rep(c("NY","MA","FL","GA"), each = 10),
Loc = rep(c("a","b","c","d","e","f","g","h"),each = 5),
ID = rep(c(1:10), each = 2),
var1 = rnorm(200),
var2 = rnorm(200),
var3 = rnorm(200),
var4 = rnorm(200),
var5 = rnorm(200),
var6 = rnorm(200))
dat1$ID <- factor(dat1$ID)
I am using the mclust package to fit a mixture model and get bootstrapped standard errors as shown here:
library(tidyverse)
library(mclust)
set.seed(123)
mod <- Mclust(dat1[,5:10], G=5, modelNames = "VEI", initialization = list(hcPairs = randomPairs(dat1[,5:10], seed = 123)))
plot(mod, what = "classification")
plot(mod, what = "density")
boot <- MclustBootstrap(mod, nboot = 25, type = "bs")#25 for now to speed up
Next I want to create a new data frame that shows the mixture probabilities for each observation, so I combine the original dat1
with mod$z
.
probs <- mod$z
colnames(probs) <- paste0("Prob", 1:mod$G)
probs <- cbind(dat1, probs)
probs <- cbind(probs, cluster = mod$classification)
Now I want collect:
1) the average value of each variable within each cluster, and 2) the standard error for each of those averages (which comes from summary(boot, what = "SE")$mean
). I will use these to create the plot below, and return a table that shows the average and SE values.
newdat <- cbind(dat1, cluster = mod$classification)
a <-
newdat%>%
dplyr::select_if(is.numeric)%>%
dplyr::group_by(cluster)%>%
summarise_all(mean)%>%
pivot_longer(-c("cluster"), names_to = "Vars", values_to = "mean")
b<-
as.data.frame(cbind(cluster=1:mod$G, t(summary(boot, what = "se")$mean)))%>%
pivot_longer(., -c("cluster"), names_to = "Vars", values_to = "SE")
a<-dplyr::mutate(a, "SE" = b$SE)
ggplot(a, aes(x=Vars, y=mean, group=factor(cluster))) +
geom_line(aes(colour=factor(cluster)))+
geom_point()+
geom_errorbar(aes(ymin=mean-SE, ymax=mean+SE), width = .1)+
ggtitle("Mean by cluster") +
expand_limits(y=0) +
scale_y_continuous(breaks=0:20*4) +
labs(x = "Var", y = "Cluster Average")+
theme_bw() +
theme(legend.justification=c(1,0),
legend.position=c(1,0))
library(knitr)
kable(a)
Ultimately I want to write a function that will perform each of these steps and return the output at once. It will be applied to data frames structured just like dat1
, except they will have different numbers of var
columns. I will also need to specify how many clusters to use and which model to use when mod
is created. What is a better way for me to collect the information that is stored in a
so that the same process can be applied to any combination of variables (var.
columns) and mixture components (G
specified in mod
) from within a function?