I am trying to use kfold CV as a means of evaluating a model run using brms and I feel like I'm missing something. As a reproducible example, my data are structured as a binary response (0, 1) dependent on the length of an individual. Here is some code to generate and plot data similar to those I am working with:
library(brms)
library(tidyverse)
library(loo)
length <- seq(0, 100, by = 1)
n_fish_per_length <- 10
a0 <- -48
a1 <- 2
a2 <- -0.02
prob <- plogis(a0 + a1 * length + a2 * length^2)
plot(length, prob , type = 'l')
sim_data <-
expand_grid(fish_id = seq_len(n_fish_per_length),
length = length) %>%
mutate(prob_use = plogis(a0 + a1 * length + a2 * length^2)) %>%
mutate(is_carp = rbinom(n = n(), size = 1, prob= prob_use))
ggplot(sim_data, aes(x = length, y = is_carp)) +
geom_jitter(width = 0, height = 0.05) +
geom_smooth(method = "glm", formula = y ~ x + I(x^2),
method.args = list(family = binomial(link = "logit")))
I then use brms to run my model.
Bayes_Model_Binary <- brm(formula = is_carp ~ length + I(length^2),
data=sim_data,
family = bernoulli(link = "logit"),
warmup = 2500,
iter = 5000,
chains = 4,
inits= "0",
cores=4,
seed = 123)
summary(Bayes_Model_Binary)
I'd like to use kfold CV to evaluate the model. I can use something like this:
kfold(Bayes_Model_Binary, K = 10, chains = 1, save_fits = T)
but the response in my data is highly imbalanced (~18% = 1, ~82% = 0) and my reading suggests that I need to used stratified kfold cv to account for this. If I use:
sim_data$fold <- kfold_split_stratified(K = 10, x = sim_data$is_carp)
the data are split the way I would expect but I'm not sure what the best way is to move forward with the CV process from here. I saw this post https://mc-stan.org/loo/articles/loo2-elpd.html, but I'm not sure how to modify this to work with a brmsfit object. Alternatively, it appears that I should be able to use:
kfold(Bayes_Model_Binary, K = 10, folds = 'stratified', group = sim_data$is_carp)
but this throws an error. Likely because is_carp is the response rather than a predictor in the model. What would my group be in this context? Am I missing/misinterpreting something here? I'm assuming that there is a very simple solution here that I am overlooking but appreciate any thoughts.