data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

Question

I have detected a peculiar effect that RMSE gets lower for the testing set than that of the training set with the sample function with the caret package.

My code does a common split of training and testing set:

  set.seed(seed)
  training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
  training.set <- dataset[training.index, ]
  testing.set <- dataset[-training.index, ]

This e.g. gives an RMSE for testing set 0.651 which is higher than training set RMSE 0.575 - as expected.

Following the recommendation of many sources, e.g. here, the data should be shuffled, so I do it before the above split:

  # shuffle data - short version:
  set.seed(17)
  dataset <- data %>% nrow %>% sample %>% data[.,]

After this shuffle, the testing set RMSE gets lower 0.528 than the training set RMSE 0.575! This finding is consistent across a number of algorithms including lm, glm, knn, kknn, rf, gbm, svmLinear, svmRadial etc.

According to my knowledge, the default of sample() is replace = FALSE so there can't be any data leakage into the testing set. The same observation occurs in classification (for accuracy and kappa) although the createDataPartition performs stratification, so any data imbalance should be handled.

I don't use any extraordinary configuration, just ordinary cross-validation:

  training.configuration <- trainControl(
    method = "repeatedcv", number = 10
    , repeats = CV.REPEATS
    , savePredictions = "final",
    # , returnResamp = "all"
    )

What did I miss here?

--

Update 1: Hunch about data leakage into testing set

I checked the data distribution and found a potential hint for the described effect.

Training set distribution:

  . Freq      prop
1 1  124 13.581599
2 2  581 63.636364
3 3  194 21.248631
4 4   14  1.533406

Testing set distribution without shuffle:

  . Freq      prop
1 1   42 18.502203
2 2  134 59.030837
3 3   45 19.823789
4 4    6  2.643172

Testing set distribution with shuffle:

  . Freq      prop
1 1   37 16.299559
2 2  139 61.233480
3 3   45 19.823789
4 4    6  2.643172

If we look at the mode (most frequent value), its proportion in the testing set with shuffle 61.2% is closer to the training set proportion 63.6% than without shuffle 59.0%.

I don't know how to interpret this statistically by underlying theory - can anybody?

An intuition of mine is that the shuffling makes the stratification of the testing set distribution (implicitly performed by createDataPartition()) "more stratified" - by that I mean "closer to the training set distribution". This may cause the effect of data leakage into the opposite direction - into the testing set.

Update 2: Reproducible Code

library(caret)
library(tidyverse)
library(magrittr)
library(mlbench)

data(BostonHousing)

seed <- 171

# shuffled <- TRUE
shuffled <- FALSE

if (shuffled) {
  dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
  dataset <- BostonHousing %>% as_tibble()
}

target_label <- "medv"
features_labels <- dataset  %>% select_if(is.numeric) %>%
  select(-target_label) %>% names %T>% print

# define ml algorithms to train
algorithm_list <- c(
  "lm"
  , "glmnet"
  , "knn"
  , "gbm"
  , "rf"
)

# repeated cv
training_configuration <- trainControl(
  method = "repeatedcv", number = 10
  , repeats = 10
  , savePredictions = "final",
  # , returnResamp = "all"
)

# preprocess by standardization within each k-fold
preprocess_configuration = c("center", "scale")

# select variables
dataset %<>% select(target_label, features_labels) %>% na.omit

# dataset subsetting for tibble: [[
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]

########################################
# 3.2: Select the target & features
########################################
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame

########################################
# 3.3: Train the models
########################################
models.list <- list()

models.list <- algorithm_list %>%

  map(function(algorithm_label) {
    model <- train(
      x = features,
      y = target,
      method = algorithm_label,
      preProcess = preprocess_configuration,
      trControl = training_configuration
    )
    return(model)
    }
  ) %>%
  setNames(algorithm_list)

UPDATE: Code to calculate testingset performance

observed <- testing.set[[target_label]]
models.list %>%
  predict(testing.set) %>%
  map_df(function(predicted) {
    sqrt(mean((observed - predicted)^2))
    }) %>%
  t %>% as_tibble(rownames = "model") %>%
  rename(RMSE.testing = V1) %>%
  arrange(RMSE.testing) %>%
  as.data.frame

Running this code both for shuffled = FALSE and shuffled = TRUE on the testing.set gives:

   model RMSE.testing RMSE.testing.shuffled
1    gbm       3.436164       2.355525
2 glmnet       4.516441       3.785895
3    knn       3.175147       3.340218
4     lm       4.501077       3.843405
5     rf       3.366466       2.092024

The effect is reproducible!

I use `split_ratio = .8` but it doesn't really matter for the described effect. updated description. — Agile Bean, Sep 28 '19 at 15:17
no i did mean `sample` and yes, changing the order was the goal of shuffling. it was not about the columns as in your example - i applied `sample` on `nrow` to shuffle the rows. — Agile Bean, Sep 28 '19 at 15:20
in this concrete example, it was 1140. but i had the same observation with a very different dataset with similar size (n ~ 1000). — Agile Bean, Sep 28 '19 at 15:28
If I understand correctly, the example uses regression and the output `y` can be 1, 2, 3, and 4? Could you post the complete code for the examples from loading the libraries? It would be great if you could do it on an inbuilt data set. — missuse, Sep 29 '19 at 05:42
yes @missuse that's correct. sure i can post the complete code but how do i do it on an inbuilt dataset? sorry don't know the term. i guess you mean don't post the data by `dput`? — Agile Bean, Sep 29 '19 at 06:17
In built data set is any data set present in R packages. I suggest using one from [mlbench](https://cran.r-project.org/web/packages/mlbench/index.html). Alternatively you can provide you own data set however 1000 rows is a bit too much to `dput` and external links can change so the question wont be reproducible after some time. Perhaps offering the data set as a [gist](https://gist.github.com/). — missuse, Sep 29 '19 at 06:28
thanks a lot. i appended the complete code in the description. it is ready to use with any data by setting variable `dataset <- data`, and target_label and feature_labels as char vectors. and i will try to adapt to a mlbench dataset later. — Agile Bean, Sep 29 '19 at 06:35
i added a reproducible example with the built-in dataset mlbench::BostingHousing. @missuse thanks for the inspiration, I was also curious if the effect i found only depended on my data - but apparently it didn't! — Agile Bean, Sep 29 '19 at 07:17
@missuse did you have a look at my update with the mlbench data as you suggested? — Agile Bean, Sep 30 '19 at 10:22
I have. In order to get a more clear picture of what is going on I have to do a little bit more testing. Currently I do not have time for this. When and if I get any ideas I will post. As a first step I think the shuffling should be performed with different seeds and compared with the non shuffled variant. — missuse, Sep 30 '19 at 10:49
@missuse ok thanks for trying, and i hope you find an underlying reason. At this moment, it is safe to advise people when doing machine learning with the `caret` package to *not* do shuffling if they want to get an accurate estimate of the testing set performance! — Agile Bean, Oct 01 '19 at 11:09
@missuse I put a bounty of 50 points for finding the answer if that's an incentive. just found out it ends in 6 days unfortunately. — Agile Bean, Oct 02 '19 at 10:51
It looks like your example never touches testing.set, are you looking at that to evaluate? Can you add the code you're using for that? — jwde, Oct 02 '19 at 20:14
@jwde yeah sure, but the code is trivial so I didn't include - using the `predict` function on the testing set. but try yourself - the effect is robust for different samples and testing sets. — Agile Bean, Oct 03 '19 at 02:54
It may be trivial, but it would be helpful to include, both to make it easier for others to investigate, and on the off-chance that step is introducing a bias somehow. — Jon Spring, Oct 03 '19 at 04:58
Ok I added the code to calculate the testingset performance. As you can see, nothing fancy. The results will vary a bit in different runs, but the effect remains robust - i.e. the RMSE of shuffled data turns out lower than of non-shuffled data. — Agile Bean, Oct 03 '19 at 05:15

score 6 · Answer 1 · answered Oct 03 '19 at 15:53

The reason you get a different test RMSE is because you have a different test set. You are shuffling your data and then using the same training.index each time so there's no reason to believe the test set would be the same each time.

In your original comparison you need to compare the RMSE from the shuffled test data with the RMSE of the shuffled training data, not the original training data.

Edit: the shuffling is also unnecessary as createDataPartition has its own sampling scheme. You can just change the seed if you want a different test/training split

score 1 · Answer 2 · answered Oct 07 '19 at 17:33

I completely agree with the answer of Jonny Phelps. Based on your code and caret functions code there is no reason to suspect any kind of data leakage when using createDataPartition on shuffled data. So the variation in performance must be due to different train/test splits.

In order to prove this I checked the performance of the shuffled and non-shuffled workflow using 10 different seeds with 4 algorithms:

I omitted lm and replaced gbm with xgboost. The reason is my own preference.

In my opinion the result suggests there is no performance pattern between shuffled and non shuffled data. Perhaps only KNN looks suspicions. But this is just one algorithm.

Code:

create initial seeds:

set.seed(1)
gr <- expand.grid(sample = sample(1L:1e5L, 10),
                  shuffled = c(FALSE, TRUE))

loop over initial seeds:

apply(gr, 1, function(x){
  print(x)
  shuffled <- x[2]
  set.seed(x[1])

  if (shuffled) {
    dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
  } else {
    dataset <- BostonHousing %>% as_tibble()
  }

  target_label <- "medv"
  features_labels <- dataset  %>% select_if(is.numeric) %>%
    select(-target_label) %>% names %T>% print


  algorithm_list <- c(
    "glmnet",
    "knn",
    "rf",
    "xgbTree"
  )

  training_configuration <- trainControl(
    method = "repeatedcv",
    number = 5, #5 folds and 3 reps is plenty 
    repeats = 3,
    savePredictions = "final",
    search = "random") #tune hyper parameters 

  preprocess_configuration = c("center", "scale")

  dataset %<>% select(target_label, features_labels) %>% na.omit

  set.seed(x[1])
  training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)

  training.set <- dataset[training.index, ]
  testing.set <- testing.set <- dataset[-training.index, ]

  target <- training.set[[target_label]]
  features <- training.set %>% select(features_labels) %>% as.data.frame

  models.list <- list()

  models.list <- algorithm_list %>%

    map(function(algorithm_label) {
      model <- train(
        x = features,
        y = target,
        method = algorithm_label,
        preProcess = preprocess_configuration,
        trControl = training_configuration,
        tuneLength = 100 #get decent hyper parameters
      )
      return(model)
    }
    ) %>%
    setNames(algorithm_list)

  observed <- testing.set[[target_label]]
  models.list %>%
    predict(testing.set) %>%
    map_df(function(predicted) {
      sqrt(mean((observed - predicted)^2))
    }) %>%
    t %>% as_tibble(rownames = "model") %>%
    rename(RMSE.testing = V1) %>%
    arrange(RMSE.testing) %>%
    as.data.frame
}) -> perf

do.call(rbind, perf) %>%
  mutate(shuffled = rep(c(FALSE, TRUE), each = 40)) %>%
  ggplot()+
  geom_boxplot(aes(x = model, y = RMSE.testing, color = shuffled)) +
  theme_bw()

data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

Update 1: Hunch about data leakage into testing set

Update 2: Reproducible Code

UPDATE: Code to calculate testingset performance

2 Answers2