Loop regression through different data frames

Question

The goal is to do multiple logit regressions on different training data frames and then to evaluate the performance on respective test data sets.

First, we create multiple (5) randomly sampled data frames from df:

for(i in 1:5) {
nr <- paste("random_df", i, sep = "_")
assign(nr, random_df[sample(nrow(df)),])
}

Then, we create indicators for the separation into training- and test set:

train <- 1:(length(df$y)*0.8)
test  <- !(1:nrow(df) %in% train)

Now we'd like to loop a logit regression over each training data frame. Herein lies the first problem as we're only able to create the output into a matrix or list. We create alternatively a list, data frame or matrix of the random samples:

lr_list <- list(random_df_1,random_df_2,random_df_3,random_df_4,random_df_5)

Then, we loop the logit regressions over all data frames within the list:

for(i in 1:5) {
  index <- paste("lr_train", i, sep = "_")
  assign(index, lapply(lr_list, function(x) {glm(y ~ x1 + x2, data=x, 
  subset=train, family=binomial)}))
}

Here lapply creates a list and sapply a matrix of each estimation result.

The goal is to get an output in the form of a glm object in order to conduct predictions using each train glm and thus be able to evaluate the model performance across different test-/train data constellations:

lr_test_1 <- predict(lr_train_1, random_df_1[test, ], type="response")

Any help is very appreciated.

You should check out the `modelr` package. It makes a lot of this stuff easier: https://github.com/tidyverse/modelr — Andrew Brēza, Aug 02 '17 at 13:27
Thanks @AndrewBrēza Could you give me a hint on how not only `resample` but to evaluate the model on **multiple train- and test sets** using the `modelr`package? — Dima, Aug 04 '17 at 21:06

score 3 · Accepted Answer · answered Aug 02 '17 at 13:38

I'm pretty sure you misunderstand the use and output of lapply. It loops over the input and creates a list of objects which are of the typical class of the function output you used.

If I read your code correctly, this part

for(i in 1:5) {
  index <- paste("lr_train", i, sep = "_")
  assign(index, lapply(lr_list, function(x) {glm(y ~ x1 + x2, data=x, 
  subset=train, family=binomial)}))
}

is essentially looping twice over the same thing, therefore creating five identical lists.

Instead, just use:

lr_train <- lapply(lr_list, function(x) {glm(y ~ x1 + x2, data=x, 
      subset=train, family=binomial)})

and then mapply your predict over the list of glm objects.

lr_test <- mapply(function(x, y) predict(x, y[test, ], type = "response"), lr_train, lr_list)

Please note that this is untested due to the lack of a working data example.

Thanks @Leo P. This worked fine, although untested. And you we right, I misunderstood `lappy`. We went on creating a factor using: `lr_estim <- cut(lr_test, breaks=c(-Inf, .5, Inf), labels=c(0,1))`. Then we created a list via `lr_tab <- lapply(lr_list, function(x) {table(x$y[test], lr_estim, dnn = c("real", "prediction"))})`. The next problem consists in creating a `prop.table` from _lr_tab_, as it's a list, not a table. Thanks for any help. — Dima, Aug 04 '17 at 21:01

Andrey Kolyadin · Answer 2 · 2017-08-07T07:09:39.333

Well, not long ago I learned a trick with dplyr and purrr packages. It's about working with nested data.frame:

nested_df <- tibble(subdf = 1:5) %>% # Choose number of 'random_df' 
  rowwise() %>% 
  mutate(data = list(df[sample(nrow(df)),])) %>% # create a list of random data.frames within our data.frame
  ungroup() %>% 
  mutate(model = map(data, ~ glm(y ~ x1 + x2, data = .x, 
                                     subset = train, family = binomial))) # iterate with map throug all of data.frame's in column data


nested_df

  subdf                   data     model
  <int>                 <list>    <list>
1     1 <data.frame [100 x 3]> <S3: glm>
2     2 <data.frame [100 x 3]> <S3: glm>
3     3 <data.frame [100 x 3]> <S3: glm>
4     4 <data.frame [100 x 3]> <S3: glm>
5     5 <data.frame [100 x 3]> <S3: glm>

And we can look an every model:

nested_df$model[[1]]

Call:  glm(formula = y ~ x1 + x2, family = binomial, data = .x, subset = train)

Coefficients:
(Intercept)          x1b          x1c           x2  
  3.467e+00   -5.085e-03    1.300e-02    9.368e-05  

Degrees of Freedom: 79 Total (i.e. Null);  76 Residual
Null Deviance:      0.3428 
Residual Deviance: 0.3408   AIC: 12.7

Output is from my quickly simulated df

df <- data.frame(y = rnorm(100, 100),
                 x1 = sample(letters[1:3], size = 100, replace = T),
                 x2 = runif(100 ,0, 1000)) %>% 
  mutate(y = y/max(y))

You can make predictions for each glm with similar structure of mutate() and map()

you need to use map2/pmap for prediction... %>% mutate(pred = map2(model, data, predict)) — Patrik_P, Aug 02 '17 at 14:45

Loop regression through different data frames

2 Answers2