I have height data (numeric height data in cm; Height) of plants measured over time (numeric data expressed in days of the year; Doy). These data is grouped per genotype (factor data; Genotype) and individual plant (Factor data; Individual). I've managed to calculate the RMSE of the location-scale GAM but I can't figure out how to bootstrap the uncertainty estimate on the RMSE calculation given it is a hierarchical location-scale generalized additive model.
The code to extract the RMSE value looks something like this:
# The GAM
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(), # Gaussian location-scale
method = "REML",
data = data)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], data, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
# The root mean square error is given by taking the square root of the MSE
sqrt(CV$cvscale[1])`
There is only one height measurement per Individual per day of the year. I figure this is problematic in maintaining the exact same formulation of the GAM. In thsi regard, I was thinking of making sure that the same few Individuals of each genotype (let's say n = 4) were randomly sampled over each day of the year. I can't figure out how to proceed though. Any ideas?
I've tried several methods, such as the boot package and for loops. An example of one of things I've tried is:
lm=list();counter=0
lm2=list()
loops = 3
for (i in 1:loops){
datax <- data %>%
group_by(Doy, Genotype) %>%
slice_sample(prop = 0.6, replace = T)
datax
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype,
~ s(Doy, bs = 'ps', by = Genotype) +
s(Doy, Individual, bs = "re") +
Genotype),
family = gaulss(),
method = "REML",
data = datax)
# Extract the model formula
form <- formula.gam(model)
# Cross-validation for the location
CV <- CVgam(form[[1]], datax, nfold = 10, debug.level = 0, method = "GCV.Cp",
printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)
RMSE[i] <- sqrt(CV$cvscale[c(1)])
}
RMSE
This loop runs very slow and just returns me 3 times the same RMSE values; Surely, there is an issue with the sampling.
Unfortunately, I can't share my data but maybe somebody has an idea on how to proceed?
Many thanks!