I have detected a peculiar effect that RMSE gets lower for the testing set than that of the training set with the sample
function with the caret
package.
My code does a common split of training and testing set:
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- dataset[-training.index, ]
This e.g. gives an RMSE for testing set 0.651
which is higher than training set RMSE 0.575
- as expected.
Following the recommendation of many sources, e.g. here, the data should be shuffled, so I do it before the above split:
# shuffle data - short version:
set.seed(17)
dataset <- data %>% nrow %>% sample %>% data[.,]
After this shuffle, the testing set RMSE gets lower 0.528
than the training set RMSE 0.575
! This finding is consistent across a number of algorithms including lm, glm, knn, kknn, rf, gbm, svmLinear, svmRadial
etc.
According to my knowledge, the default of sample() is replace = FALSE
so there can't be any data leakage into the testing set. The same observation occurs in classification (for accuracy and kappa) although the createDataPartition
performs stratification, so any data imbalance should be handled.
I don't use any extraordinary configuration, just ordinary cross-validation:
training.configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = CV.REPEATS
, savePredictions = "final",
# , returnResamp = "all"
)
What did I miss here?
--
Update 1: Hunch about data leakage into testing set
I checked the data distribution and found a potential hint for the described effect.
Training set distribution:
. Freq prop
1 1 124 13.581599
2 2 581 63.636364
3 3 194 21.248631
4 4 14 1.533406
Testing set distribution without shuffle:
. Freq prop
1 1 42 18.502203
2 2 134 59.030837
3 3 45 19.823789
4 4 6 2.643172
Testing set distribution with shuffle:
. Freq prop
1 1 37 16.299559
2 2 139 61.233480
3 3 45 19.823789
4 4 6 2.643172
If we look at the mode (most frequent value), its proportion in the testing set with shuffle 61.2%
is closer to the training set proportion 63.6%
than without shuffle 59.0%
.
I don't know how to interpret this statistically by underlying theory - can anybody?
An intuition of mine is that the shuffling makes the stratification of the testing set distribution (implicitly performed by createDataPartition()
) "more stratified" - by that I mean "closer to the training set distribution". This may cause the effect of data leakage into the opposite direction - into the testing set.
Update 2: Reproducible Code
library(caret)
library(tidyverse)
library(magrittr)
library(mlbench)
data(BostonHousing)
seed <- 171
# shuffled <- TRUE
shuffled <- FALSE
if (shuffled) {
dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
dataset <- BostonHousing %>% as_tibble()
}
target_label <- "medv"
features_labels <- dataset %>% select_if(is.numeric) %>%
select(-target_label) %>% names %T>% print
# define ml algorithms to train
algorithm_list <- c(
"lm"
, "glmnet"
, "knn"
, "gbm"
, "rf"
)
# repeated cv
training_configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = 10
, savePredictions = "final",
# , returnResamp = "all"
)
# preprocess by standardization within each k-fold
preprocess_configuration = c("center", "scale")
# select variables
dataset %<>% select(target_label, features_labels) %>% na.omit
# dataset subsetting for tibble: [[
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]
########################################
# 3.2: Select the target & features
########################################
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame
########################################
# 3.3: Train the models
########################################
models.list <- list()
models.list <- algorithm_list %>%
map(function(algorithm_label) {
model <- train(
x = features,
y = target,
method = algorithm_label,
preProcess = preprocess_configuration,
trControl = training_configuration
)
return(model)
}
) %>%
setNames(algorithm_list)
UPDATE: Code to calculate testingset performance
observed <- testing.set[[target_label]]
models.list %>%
predict(testing.set) %>%
map_df(function(predicted) {
sqrt(mean((observed - predicted)^2))
}) %>%
t %>% as_tibble(rownames = "model") %>%
rename(RMSE.testing = V1) %>%
arrange(RMSE.testing) %>%
as.data.frame
Running this code both for shuffled = FALSE
and shuffled = TRUE
on the testing.set
gives:
model RMSE.testing RMSE.testing.shuffled
1 gbm 3.436164 2.355525
2 glmnet 4.516441 3.785895
3 knn 3.175147 3.340218
4 lm 4.501077 3.843405
5 rf 3.366466 2.092024
The effect is reproducible!