Apply glm() by filtering a column by its value in R

Question

I have a dataframe with let's call it dependent variable, various independent variables (indicators) and a filtering variable. My goal is to run regressions by filtering different categories in my filtering variable. For example, if I want to run regression for code == "all", I will just take my dataframe, filter the code, and run a regression:

sample_tib %>%
    filter(code == "all") %>%
    glm(love ~ ., data = ., family = "gaussian")

But there are several problems that I am facing:

In my example above my glm() will take all columns, not excepting the code. The desirable input into the regression is love ~ ind1 + ind2 + ... + ind_n;
Filtering by all codes in code and running different models is costly and not really the thing that I want.

Maybe there exist a function which filters the dataframe, then runs a regression and nests its results in a new dataframe or list? I tried to figure this out and came across this question and beautiful Dave Gruenewald's solution. But his way takes only one pattern - x ~ y, one dependent and one independent variable. Which is obviously not what I need.

So, is there any elegant solutions or specific packages and functions for this problem?

Data:

sample_tib <- data.frame(
  code = c(
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "all",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Science",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer",
    "Data Engineer"
  ),
  love = runif(36),
  ind1 = runif(36),
  ind2 = runif(36),
  ind3 = runif(36),
  ind4 = runif(36),
  ind5 = runif(36),
  ind6 = runif(36),
  ind7 = runif(36)
)

akrun · Answer 1 · 2021-06-17T19:42:48.910

We can use nest_by from dplyr

We just use nest_by to do the grouping
Simply create the model in a list within mutate

NOTE: No other packages other than dplyr is used

library(dplyr)
sample_tib %>%
    nest_by(code) %>%
    mutate(model = list(glm(love ~ ., data = data, family = 'gaussian'))) %>%
    ungroup

-output

# A tibble: 3 x 3
  code                        data model 
  <chr>         <list<tibble[,8]>> <list>
1 all                     [12 × 8] <glm> 
2 Data Engineer           [12 × 8] <glm> 
3 Data Science            [12 × 8] <glm>

score 1 · Accepted Answer · answered Jun 17 '21 at 10:33

1

We can split the data and apply glm to each code separately.

library(dplyr)
library(purrr)

sample_tib %>%
  group_split(code) %>%
  map(function(x) glm(love~., data = select(x, -code), family = "gaussian"))

select(x, -code) drops code columns from the data so you can use love~..

answered Jun 17 '21 at 10:33

Ronak Shah

377,200
20
156
213

thank you. and how can i retain categories names as list names? – rg4s Jun 17 '21 at 10:39
2

If you want the list names use `split(.$code)` instead of `group_split`. – Ronak Shah Jun 17 '21 at 10:42

Apply glm() by filtering a column by its value in R

2 Answers2