1

I am going to use a penalized regression model from the glmnet package on a panel dataset. Being panel means that I will test the model not with cross validation but with rolling origin, so I will not use the cv.glmnet function, that works only with cross-validation. Instead, I will use the default glmnet function. This function have two parameters that need to be optimized, lambda and alpha. lambdais allowed to be an array, and if so, a model for each element in the array is fitted, but alpha is required to be a scalar. I am interested in running models for different alpha values. So far, I am doing this sequentially with a for loop, like in the following example (in this example I use the BostonHousing dataset for simplicity, though is not panel data)

# Load libraries
library(glmnet)
library(mlbench)
library(tidyverse)

# Load dataset
data(BostonHousing)

# Split into train / test
x_train = BostonHousing %>% slice(1:400) %>%  select(-c(medv, chas)) %>% as.matrix()
x_test = BostonHousing %>% slice(401:n()) %>% select(-c(medv, chas)) %>% as.matrix()

y_train = BostonHousing %>%  slice(1:400) %>%  select(medv) %>% as.matrix()
y_test = BostonHousing %>%  slice(401:n()) %>%  select(medv) %>% as.matrix()

# Define grid for parameters
lambda_param = 10^seq(-3, 1, by=0.5)
alpha_param = seq(0, 1, 0.1)

# Sequential approach
mse_sequential = matrix(NA,length(lambda_param), length(alpha_param))
for(i in seq_along(alpha_param))
{
  fit_seq = glmnet(x=x_train, y=y_train, family='gaussian', lambda=lambda_param, alpha=alpha_param[i])
  p = predict(fit_seq, newx=x_test)
  mse_sequential[,i] = map_dbl(seq_along(lambda_param), function(j){mean((p[,j]-y_test)^2)})
}

This approach works, but seems like highly inefficient to me, and I considered parallelizing it. I tried to use the furrr package (that allows for paralelization of the purrr functions) but I am facing two problems:

Problem 1

The folllowing piece of code (suposedly) runs furrr sequentially:

library(furrr)
plan(sequential)
glmnet_alpha = partial(glmnet, x = x_train, y = y_train, family = "gaussian", weights = rep(1, nrow(x_train)), lambda=lambda_param)
fit = future_map(alpha_param, glmnet_alpha)

But the solution achieved using a for loop differs from the solution achieved using furrr. For example, having a look to the coefficients for lambda=10 and alpha=0:

fit_seq = glmnet(x=x_train, y=y_train, family='gaussian', lambda=10, alpha=0)
as.vector(fit_seq$beta)
as.vector(fit[[1]]$beta[,1])

Problem 2

If I try to run furrr in parallel I get an error message:

plan(multiprocess)
glmnet_alpha = partial(glmnet, x = x_train, y = y_train, family = "gaussian", weights = rep(1, nrow(x_train)), lambda=lambda_param)
fit = future_map(alpha_param, glmnet_alpha)

Error in drop(y) : objeto 'y_train' no encontrado

So any help understanding what is going on here, or providing an alternative approach for running this in parallel would be appreciated.

  • your for loop runs very quickly on my system.. When you parallelize, you also need to take into account the time to send the environment or objects to the other CPUs, and to combine it again, so I really don't think you gain much – StupidWolf Apr 20 '20 at 12:22
  • also if you check your fit with future_map, the coefficients are all the same. meaning your partial did not work... I think you need to sort out how to use partial correctly – StupidWolf Apr 20 '20 at 12:30
  • I think you would be best of using a meta package for this sort of thing: caret, mlr or mlr3 are my recommendations. I recommend picking one of them and reading up a bit and if you get stuck post another question. Caret is probably the most popular and well known. I am a fan of mlr3. – missuse Apr 20 '20 at 15:28

0 Answers0