I'm trying to run LightGBM with 5-fold cross-validation to predict the first 123 PCs of a plasma metabolite principal component analysis. I'd like to get the R-squared for the best iteration for each outcome, but can't find a direct way to extract it.
As a work-around, I calculated R-squared as the mean squared error divided by the variation in y, but I'm unsure if this is an appropriate way to do so. Is the MSE reflective only of the model fit on the held out fold or the entire dataset? If it's only for the held-out fold, then variation in y calculated using the entire dataset seems inappropriate to use. Thanks!
##############
#Loop for 5-fold Cross Validation of plasma metabolite PCs LightGBM
#extract the MSE to calculate R-squared
#Overall Model
#create needed datasets and lists
dat_x =as.matrix(plasma_pca_pred[, 124:ncol(plasma_pca_pred)])
mse.cv <- NULL
r_sq.cv <- NULL
#running in a loop for each included plasma PC
for (i in 1:123) {
#subset data to metabolite for testing & all predictors
dat_y.i = as.numeric(plasma_pca_pred[,i])
dat.i = lgb.Dataset(dat_x, label = dat_y.i)
light_gbn.cv <- lgb.cv(
params = list(
objective = "regression",
metric = "l2",
max_depth = 5,
num_leaves =25,
min_data_in_leaf=15,
num_iterations = 200,
early_stopping_rounds=100,
learning_rate = .005,
feature_fraction = .2,
bagging_fraction = 0.8,
bagging_freq=1,
num_threads=2,
verbosity=-1),
data=dat.i,
nfold=5L)
mse.cv[i] <- light_gbn.cv$best_score #MSE
r_sq.cv[i] <- 1 - mse.cv[i]/var(dat_y.i) #r-squared
}
r_sq.cv