1

I have a dataset that I want to use to estimate some linear models. The data can be replicated with the code below.

set.seed(123)
df <- tibble(
  location= c(rep("A",5),rep("B",5), rep("C",5), 
              rep("D",5), rep("E",5), rep("F",5),
              rep("G",5), rep("H",5), rep("I",5),
              rep("J",5)),
  year = rep(c(2017,2018,2019,2020,2021),10),
  w = rnorm(50),  
  x = rnorm(50),
  y = rnorm(50),
  z = rnorm(50)
)

It would have the following look

location year w x y z
A 2017 -0.560 0.253 -0.710 0.788
A 2018 -0.230 -0.0285 0.257 0.769
A 2019 1.56 -0.0429 -0.247 0.332
A 2020 -0.230 -0.0285 0.257 0.769
A 2021 -0.230 -0.0285 0.257 0.769
B 2017 -0.560 0.253 -0.710 0.788
B 2018 -0.230 -0.0285 0.257 0.769
B 2019 1.56 -0.0429 -0.247 0.332
B 2020 -0.230 -0.0285 0.257 0.769
B 2021 -0.230 -0.0285 0.257 0.769

I want to run models with two dependent variables (w and x) using y and z as explanatory variables for all the years separately and get the distribution of the parameter estimates. Basically,in the end I would like to have a dataframe for the estimates across the years for each dependent variable.

The approach I initially considered involves running the models individually, duplicating the same steps for each, and then combining the rows from these separate dataframes. However, I want to use purrr's funcationality to write an efficient code.

Here is what I have coded so far.

library(tidyverse)
library(tidymodels)

dependent_vars<- c("w","x")
model_1 <- reformulate(termlabels = c("y","z"), response = dependent_vars[1] )

lm_mod <- linear_reg()

lm_fit_2017_w <- 
  lm_mod %>% 
  fit(model_1, data = df %>% filter (year==2017)) %>%
  tidy() %>% 
  mutate(year=2017, response= dependent_vars[1])

# Repeat this for the rest of the years as well as for x and finally bind the rows

I have come across some resources dealing with the same type of problem but with just different dependent variables or predicators stackoverlow question & blog.

Thank you in advance for the help.

1 Answers1

1

I'd first generate a data.frame with all possible combinations of the years and dependent variables, and then use this with pmap which can take two or more input variables at the same time.

I've attached an untested solution with tidymodels (can't install it at the moment) and one without that is tested:

set.seed(123)
df <- tibble(
  location= c(rep("A",5),rep("B",5), rep("C",5), 
              rep("D",5), rep("E",5), rep("F",5),
              rep("G",5), rep("H",5), rep("I",5),
              rep("J",5)),
  year = rep(c(2017,2018,2019,2020,2021),10),
  w = rnorm(50),  
  x = rnorm(50),
  y = rnorm(50),
  z = rnorm(50)
)

dependent_vars<- c("w","x")
years <- unique(df$year)

library(purrr)
library(dplyr)

possible_combinations <- expand.grid(year = years, dependent_var = dependent_vars,
                                     stringsAsFactors = FALSE)

possible_combinations %>% 
  pmap(function(year, dependent_var) {
    data_model <- df %>% 
      filter(year == year)
    model_formula <- as.formula(paste0(dependent_var, " ~ y + z"))
    res <- lm(model_formula, data = data_model)
    tibble(res = list(res), year = year, dependent_var = dependent_var)
  }) %>% 
  bind_rows()

library(tidymodels)

lm_mod <- linear_reg()

possible_combinations %>% 
  pmap(function(year, dependent_var) {
    data_model <- df %>% 
      filter(year == year)
    model_formula <- as.formula(paste0(dependent_var, " ~ y + z"))
    lm_mod %>% 
      fit(model_formula, data = data_model) %>% 
      tidy() %>% 
      mutate(year = year, dependent_var = dependent_var)
  }) %>% 
  bind_rows()
starja
  • 9,887
  • 1
  • 13
  • 28
  • Thank you very much @starja. I was exactly looking for such a solution. I couldn't quite figure what map function to use. *pmap* it is. – Abraham Assefa Aug 10 '23 at 21:59