0

My aim is to plot the bias-variance decomposition of a cubic smoothing spline for varying degrees of freedom.

First I simulate a test-set (matrix) and a train-set (matrix). Then I iterate over 100 simulations and vary in each iteration the degrees of freedom of the smoothing spline.

The output I get with the below code does not show any trade-off. What am I doing wrong when calculating the bias /variance?

For reference, the right panel of this figure (slide 14) shows the tradeoff that I would expect (source)

rm(list = ls())

library(SimDesign)

set.seed(123)

n_sim <- 100
n_df <- 40
n_sample <- 100

mse_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
var_temp <- matrix(NA, nrow = n_sim, ncol = n_df)
bias_temp <- matrix(NA, nrow = n_sim, ncol = n_df)


# Train data -----
x_train <- runif(n_sample, -0.5, 0.5)
f_train <- 0.8*x_train+sin(6*x_train)

epsilon_train <- replicate(n_sim, rnorm(n_sample,0,sqrt(2)))
y_train <- replicate(n_sim,f_train) + epsilon_train

# Test data -----
x_test <- runif(n_sample, -0.5, 0.5)
f_test <- 0.8*x_test+sin(6*x_test)

epsilon_test <- replicate(n_sim, rnorm(n_sample,0,sqrt(2)))
y_test <- replicate(n_sim,f_test) + epsilon_test


for (mc_iter in seq(n_sim)){

  for (df_iter in seq(n_df)){
    cspline <- smooth.spline(x_train, y_train[,mc_iter], df=df_iter+1)

    cspline_predict <- predict(cspline, x_test)

    mse_temp[mc_iter, df_iter] <- mean((y_test[,mc_iter] - cspline_predict$y)^2)
    var_temp[mc_iter, df_iter] <- var(cspline_predict$y)
    # bias_temp[mc_iter, df_iter] <- bias(cspline_predict$y, f_test)^2
    bias_temp[mc_iter, df_iter] <- mean((replicate(n_sample, mean(cspline_predict$y))-f_test)^2)

  }
}

mse_spline <- apply(mse_temp, 2, FUN = mean)
var_spline <- apply(var_temp, 2, FUN = mean)
bias_spline <- apply(bias_temp, 2, FUN = mean)


par(mfrow=c(1,3))
plot(seq(n_df),mse_spline, type = 'l')
plot(seq(n_df),var_spline, type = 'l')
plot(seq(n_df),bias_spline, type = 'l')

1 Answers1

0

Actually I think your code works, it's just the small sample size, you hit the area of overfitting very fast, so everything in the plot is very close to the left border, in the area of few degrees of freedom. If you increase n_sample you should see the expected relation.

snaut
  • 2,261
  • 18
  • 37
  • The problem is that bias and variance are both increasing with increasing degrees of freedom. In case of overfitting the bias decreases and variance increases. However, this is not what I see using my code. – user483161 Dec 17 '18 at 12:46