0

I have height data (numeric height data in cm; Height) of plants measured over time (numeric data expressed in days of the year; Doy). These data is grouped per genotype (factor data; Genotype) and individual plant (Factor data; Individual). I've managed to calculate the RMSE of the location-scale GAM but I can't figure out how to bootstrap the uncertainty estimate on the RMSE calculation given it is a hierarchical location-scale generalized additive model.

The code to extract the RMSE value looks something like this:


# The GAM
model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
                  s(Doy, Individual, bs = "re") +
            Genotype, 
            ~ s(Doy, bs = 'ps', by = Genotype) +
              s(Doy, Individual, bs = "re") +
            Genotype),
            family = gaulss(), # Gaussian location-scale
            method = "REML",
            data = data)

# Extract the model formula
form <- formula.gam(model)

# Cross-validation for the location
CV <- CVgam(form[[1]], data, nfold = 10, debug.level = 0, method = "GCV.Cp",
              printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)

# The root mean square error is given by taking the square root of the MSE
sqrt(CV$cvscale[1])`

There is only one height measurement per Individual per day of the year. I figure this is problematic in maintaining the exact same formulation of the GAM. In thsi regard, I was thinking of making sure that the same few Individuals of each genotype (let's say n = 4) were randomly sampled over each day of the year. I can't figure out how to proceed though. Any ideas?

I've tried several methods, such as the boot package and for loops. An example of one of things I've tried is:

lm=list();counter=0
lm2=list()

loops = 3

for (i in 1:loops){
datax <- data %>% 
  group_by(Doy, Genotype) %>%
  slice_sample(prop = 0.6, replace = T)
  datax
  
  model <- gam(list(Height ~ s(Doy, bs = 'ps', by = Genotype) +
                  s(Doy, Individual, bs = "re") +
            Genotype, 
            ~ s(Doy, bs = 'ps', by = Genotype) +
              s(Doy, Individual, bs = "re") +
            Genotype),
            family = gaulss(),
            method = "REML",
            data = datax)

  # Extract the model formula
  form <- formula.gam(model)

# Cross-validation for the location
  CV <- CVgam(form[[1]], datax, nfold = 10, debug.level = 0, method = "GCV.Cp",
              printit = TRUE, cvparts = NULL, gamma = 1, seed = 29)

RMSE[i] <- sqrt(CV$cvscale[c(1)])
}

RMSE

This loop runs very slow and just returns me 3 times the same RMSE values; Surely, there is an issue with the sampling.

Unfortunately, I can't share my data but maybe somebody has an idea on how to proceed?

Many thanks!

  • Your CV isn't really testing the k-fold CV of your gaussian GAMLSS model; it's just testing a location-only model, which could well go wrong if the heterogeneity in the data (modelled by the variance linear predictor in your original model) ends up affecting the estimated smooths in the location-only model. Non-parametric bootstrapping can be problematic for GAMs; the repeated data observations that arise from sampling with replacement can affect the smoothing parameters in a way that the original data won't/doesn't. You'd be better off doing posterior simulation. – Gavin Simpson Feb 18 '23 at 13:38
  • Thanks for the suggestion! I didn't realize that posterior simulation should also be applied to GAM model validation. It seems posterior simulation and using a loss function might solve my problem. May I assume that the gamlssCV function for the gamlss framework does perform k-fold cross validation on a gamlss model without using posterior simulation? – Bertold Mariën Feb 18 '23 at 14:24

0 Answers0