Fit a model on each group and evaluate it using data from all rows not in this group, R

Question

I want to fit a model using data from each group in a dataframe. Then I want to use this model to predict it to all data in the dataframe that was not in the group and compute a metric like the RMSE. I have some issues to wrap my head around how I could achieve something like this without doing many manual steps. I have the following toy-example using the mtcarsdata.

I want to fit the model lm(mpg ~ wt, data=mtcars) for each group of cyl and use this model and the data from the remaining points to copmute a value like the RMSE.

I wrote the following code, but it is not working and also does not feel so good. I'd be happy to hear about any tipps and tricks:)


library(tidyverse)

# 'global' model
lm(mpg ~ wt, data=mtcars)

# 1. fit one model for each  class of cyl, 
# 2. use it to predict the remaining ones,
# 3. get the RMSE for each class of cyl

res = mtcars %>% 
  group_by(cyl) %>% 
  mutate(
    models = list(lm(mpg ~ wt, data = cur_data())), # why does this needs to be a list?
    ref_data = list(mtcars %>% filter(!cyl %in% cur_data()$cyl[[1]])), # get all the data minus the current group at put it in a list column
    predict = map(models, ~predict(.x, newdata=mtcars %>% filter(!cyl %in% cur_data()$cyl[[1]]))), # predict it on all others -> will store one df per row...
    rmse = map2_dbl(predict, ref_data, ~sqrt((sum(.y - .y)^2))/length(predict)) 
  )

I'd avoid using tidyverse methods for this. A simple `for` loop would be much easier to debug. Is there some reason you don't want to do that? — user2554330, Aug 11 '21 at 11:34
In fact, since there are 3 groups, you'll get 3 different predictions for each point (and you want to ignore one of them). That doesn't fit into the tidyverse model. — user2554330, Aug 11 '21 at 11:37

Rui Barradas · Answer 1 · 2021-08-11T13:39:15.567

Here are two base R solutions, one with a for loop and the other with a sapply loop.

rmse <- numeric(length(unique(mtcars$cyl)))
ucyl <- unique(mtcars$cyl)

for(i in seq_along(ucyl)){
  cc <- ucyl[i]
  fit <- lm(mpg ~ wt, data = mtcars, subset = cyl == cc)
  new_wt <- mtcars$wt[mtcars$cyl != cc]
  new_mpg <- mtcars$mpg[mtcars$cyl != cc]
  ypred <- predict(fit, newdata = data.frame(wt = new_wt))
  rmse[i] <- sqrt(mean((new_mpg - ypred)^2))
}
rmse
#[1] 4.378324 3.354043 6.942228

rmse2 <- sapply(seq_along(ucyl), function(i){
  cc <- ucyl[i]
  fit <- lm(mpg ~ wt, data = mtcars, subset = cyl == cc)
  #
  new_wt <- mtcars$wt[mtcars$cyl != cc]
  new_mpg <- mtcars$mpg[mtcars$cyl != cc]
  #
  ypred <- predict(fit, newdata = data.frame(wt = new_wt))
  sqrt(mean((new_mpg - ypred)^2))
})

identical(rmse, rmse2)
#[1] TRUE

Julia Silge · Accepted Answer · 2021-08-12T16:25:01.213

You could consider using tidymodels functions for this, because this situation is exactly what an rsample::rsplit object is intended for.

For example here, the "analysis" set has cyl == 6 and the "assessment" set has cyl equal to other values:

library(rsample)

ind <- list(analysis = which(mtcars$cyl == 6), 
            assessment = which(mtcars$cyl != 6))
make_splits(ind, mtcars)
#> <Analysis/Assess/Total>
#> <7/25/32>

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

To go through your modeling analysis, you would create a function to make an split according to your parameter (cyl here), then use purrr::map() to map over the values of that parameter and:

make the splits
fit the models to each split
predict with each model on each split (notice you predict on the assessment set)
compute RMSE

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

manual_split_from_cyl <- function(cyl_value) {
    ind <- list(analysis = which(mtcars$cyl == cyl_value), 
                assessment = which(mtcars$cyl != cyl_value))
    make_splits(ind, mtcars)
}

tibble(cyl = unique(mtcars$cyl)) %>%
    mutate(splits = map(cyl, manual_split_from_cyl),
           model = map(splits, ~ lm(mpg ~ wt, data = analysis(.))),
           preds = map2(model, splits, ~ predict(.x, newdata = assessment(.y))),
           rmse = map2_dbl(splits, preds, ~ rmse_vec(assessment(.x)$mpg, .y)))
#> # A tibble: 3 × 5
#>     cyl splits          model  preds       rmse
#>   <dbl> <list>          <list> <list>     <dbl>
#> 1     6 <split [7/25]>  <lm>   <dbl [25]>  4.38
#> 2     4 <split [11/21]> <lm>   <dbl [21]>  3.35
#> 3     8 <split [14/18]> <lm>   <dbl [18]>  6.94

^{Created on 2021-08-11 by the reprex package (v2.0.1)}

Thank you so so much! That looks so easy and useful, yet I have some troubles wrapping my head around the syntax of the tibble. E.g. why is only a `.` used in the `data = analysis(.)` and not a `.x`. — Lenn, Aug 12 '21 at 12:10
You can use `.x` if you prefer with `map()`, but you have to use `.x` and `.y` with `map2()` because there are two arguments that have to be differentiated. — Julia Silge, Aug 12 '21 at 16:24
Thanks a lot:) I think I will try to implement the tidymodels approach more:)! I was just scared to go down the rabbithole of learning the syntax of such a big package — Lenn, Aug 13 '21 at 06:13

Marco Sandri · Answer 3 · 2021-08-11T14:25:09.117

1

Here is a solution based on tidyverse.

library(tidyverse)
res = mtcars %>% 
  group_by(cyl) %>% 
  summarize(
    data1 = list(cur_data_all()),
    data2 = list(subset(mtcars, cyl!=unique(data1[[1]]$cyl) )),
    models = list( lm(mpg ~ wt, data=data1[[1]]) ),
    pred = list(predict(models[[1]], newdata=data2[[1]]) ),
    rmse = sqrt(mean((data2[[1]]$mpg - pred[[1]])^2))
  )
  print(res)

# A tibble: 3 x 6
    cyl data1              data2          models pred       rmse2
  <dbl> <list>             <list>         <list> <list>     <dbl>
1     4 <tibble [11 x 11]> <df [21 x 11]> <lm>   <dbl [21]>  3.35
2     6 <tibble [7 x 11]>  <df [25 x 11]> <lm>   <dbl [25]>  4.38
3     8 <tibble [14 x 11]> <df [18 x 11]> <lm>   <dbl [18]>  6.94

edited Aug 11 '21 at 14:25

answered Aug 11 '21 at 11:47

Marco Sandri

23,289
7
54
58

thanks a lot!!:) Just a quick question: What difference does it make if I use `cur_data()` and `cur_data_all()`. I just tried it, and with `cur_data()` it does not work:/ – Lenn Aug 11 '21 at 15:07
Oh and one last question: Why all this `list`s? I do understand what it does exactly and where the different to `map` in this case is? E.g. when you write `models = list( lm(mpg ~ wt, data=data1[[1]]) )`, why does it construct 3 different models when it is given `data2[[1]]` three times the same data? – Lenn Aug 11 '21 at 15:17
1

@Lenn `cur_data_all()` gives the current data for the current group including grouping variable(s), while `cur_data()` excludes grouping variable(s). – Marco Sandri Aug 11 '21 at 17:23
1

@Lenn Lastly, `data2[[1]]` is NOT the same dataset for each `cyl` group! Take a look at these datasets using `res$data2`. The datasets are clearly different. – Marco Sandri Aug 11 '21 at 17:37
1

@Lenn About the use of the lists: first try to compare and understand the output of `mtcars %>% group_by(cyl) %>% summarize(data1=list(cur_data_all()))` and of `mtcars %>% group_by(cyl) %>% summarize(data1=cur_data_all())`. We need only 3 rows in our tibble, one for each category of `cyl`. The second solution returns 32 rows. – Marco Sandri Aug 11 '21 at 17:38
Thank you so much for the explanations. I just do not get why the wrapping inside a `list` returns one list-element for each group. Why does it not return a list for each row, which would result in a list-column of length 32 where each single element again is a list. – Lenn Aug 12 '21 at 11:36
@Lenn Remember that `res` is a `tibble` not a data frame. The behaviour is different. – Marco Sandri Aug 12 '21 at 16:07

Fit a model on each group and evaluate it using data from all rows not in this group, R

3 Answers3