I am going to use a penalized regression model from the glmnet
package on a panel dataset. Being panel means that I will test the model not with cross validation but with rolling origin, so I will not use the cv.glmnet
function, that works only with cross-validation. Instead, I will use the default glmnet
function. This function have two parameters that need to be optimized, lambda
and alpha
. lambda
is allowed to be an array, and if so, a model for each element in the array is fitted, but alpha
is required to be a scalar. I am interested in running models for different alpha
values. So far, I am doing this sequentially with a for loop, like in the following example (in this example I use the BostonHousing
dataset for simplicity, though is not panel data)
# Load libraries
library(glmnet)
library(mlbench)
library(tidyverse)
# Load dataset
data(BostonHousing)
# Split into train / test
x_train = BostonHousing %>% slice(1:400) %>% select(-c(medv, chas)) %>% as.matrix()
x_test = BostonHousing %>% slice(401:n()) %>% select(-c(medv, chas)) %>% as.matrix()
y_train = BostonHousing %>% slice(1:400) %>% select(medv) %>% as.matrix()
y_test = BostonHousing %>% slice(401:n()) %>% select(medv) %>% as.matrix()
# Define grid for parameters
lambda_param = 10^seq(-3, 1, by=0.5)
alpha_param = seq(0, 1, 0.1)
# Sequential approach
mse_sequential = matrix(NA,length(lambda_param), length(alpha_param))
for(i in seq_along(alpha_param))
{
fit_seq = glmnet(x=x_train, y=y_train, family='gaussian', lambda=lambda_param, alpha=alpha_param[i])
p = predict(fit_seq, newx=x_test)
mse_sequential[,i] = map_dbl(seq_along(lambda_param), function(j){mean((p[,j]-y_test)^2)})
}
This approach works, but seems like highly inefficient to me, and I considered parallelizing it. I tried to use the furrr
package (that allows for paralelization of the purrr
functions) but I am facing two problems:
Problem 1
The folllowing piece of code (suposedly) runs furrr
sequentially:
library(furrr)
plan(sequential)
glmnet_alpha = partial(glmnet, x = x_train, y = y_train, family = "gaussian", weights = rep(1, nrow(x_train)), lambda=lambda_param)
fit = future_map(alpha_param, glmnet_alpha)
But the solution achieved using a for loop differs from the solution achieved using furrr
. For example, having a look to the coefficients for lambda=10
and alpha=0
:
fit_seq = glmnet(x=x_train, y=y_train, family='gaussian', lambda=10, alpha=0)
as.vector(fit_seq$beta)
as.vector(fit[[1]]$beta[,1])
Problem 2
If I try to run furrr
in parallel I get an error message:
plan(multiprocess)
glmnet_alpha = partial(glmnet, x = x_train, y = y_train, family = "gaussian", weights = rep(1, nrow(x_train)), lambda=lambda_param)
fit = future_map(alpha_param, glmnet_alpha)
Error in drop(y) : objeto 'y_train' no encontrado
So any help understanding what is going on here, or providing an alternative approach for running this in parallel would be appreciated.