R loop for linear regression lm(y~x) and save model output as a dataset

Question

I would like to make a regression loop lm(y~x) with a dataset with one y and several x, and run the regression for each x, and then also store the results (estimate, p-values) in a data.frame() so I don't have to copy them manually (especially as my real data set it much bigger). I think this should not be too difficult, but I struggle a lot to make it work and appreciate your help: Here is my sample data set:

sample_data <- data.frame(
  fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
  Xbeta = c(2.8907744,  -0.7680777,  -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990,  -0.5698354, -0.5698354),
  Xgamma = c( 0.1180758, -0.6275700, 0.3731964,  -0.2353454,-0.5761923,  -0.5186803, 0.43041835, 3.9111749, -0.5030638),
  Xalpha = c(0.2643091, 1.6663923,  0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278,  0.3183102, -0.2385278),
  Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834,  3.8966404,  1.2028175)
)
#yname <- ("fit")
#xnames <- c("Xbeta ","Xgamma", "Xalpha", "Xdelta")

The simple regression with the first independant variable Xbeta would look like this lm(fit~Xbeta, data= sample_data)and I would like to run the regression for each variable starting with an "X" and then store the result (estimate, p-value).

I have found a code that allows me to select variables that start with "X" and then use it for the model, but the code gives me an error from mutate() onwards (indicated by #).

library(tidyverse)
library(tsibble)

sample_data %>% 
  gather(stock, return, starts_with("X")) %>%  
  group_nest(stock) 
#  %>% 
#  mutate(model = map(data,
#                     ~lm(formula = "fit~ return",
#                         data = .x))
# ),
#           resid = map(model, residuals)
#           ) %>%
#           unnest(c(data,resid)) %>%  
#           summarise(sd_residual = sd(resid))

For then storing the regression results I have also found the following appraoch using the R package "broom": r for loop for regression lm(y~x)

sample_data%>% 
  group_by(y,x)%>%                            # get combinations of y and x to regress
  do(tidy(lm(fRS_relative~xvalue, data=.)))

But I always get an error for group_by() and do()

I really appreciate your help!

First things first, tidy up your data! Instead of many separate columns, use one X, one Y, and one column for factor, e.g. beta, gamma, alpha, delta. Then you can easily group by that column, and perform the regression for all of them! — mhovd, Jul 13 '20 at 13:19
@mhh In my real data set I have c. 200 independent `x` variables and I need a regression per X and can thus not group them together. I just added the X in front of the name to make it easier in the loop (or so I thought). This is why I try to do a loop & provided a sample data set (with values from my real data). — midas, Jul 13 '20 at 14:06
Thank you @mhh - now I've already got helpful answers that solved my issue! — midas, Jul 13 '20 at 14:40

score 0 · Accepted Answer · answered Jul 13 '20 at 13:33

One option would be to use lapply to perform a regression with each of the independent variables. Use tidy from broom library to store the results into a tidy format.

lapply(1:length(xnames), 
       function(i) broom::tidy(lm(as.formula(paste0('fit ~ ', xnames[i])), data = sample_data))) -> test

and then combine all the results into a single dataframe:

do.call('rbind', test)


# term        estimate std.error statistic   p.value
# <chr>          <dbl>     <dbl>     <dbl>     <dbl>
#   1 (Intercept)   1.05      0.133      7.89  0.0000995
# 2 Xbeta        -0.156     0.0958    -1.62  0.148    
# 3 (Intercept)   0.968     0.147      6.57  0.000313 
# 4 Xgamma        0.0712    0.107      0.662 0.529    
# 5 (Intercept)   1.09      0.131      8.34  0.0000697
# 6 Xalpha       -0.0999    0.0508    -1.96  0.0902   
# 7 (Intercept)   0.998     0.175      5.72  0.000723 
# 8 Xdelta       -0.0114    0.0909    -0.125 0.904

Thank you! This really looks much easier and definitely works with my sample.data - with my real data I get the error `error in: lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases` but I guess this I can resolve by cleaning up the data — midas, Jul 13 '20 at 14:01
As I have suspected, after cleaning up the data (i.e. deleted columns that included fewer than 2 values), it works fine now. Thank you a lot! — midas, Jul 13 '20 at 14:34

score 0 · Answer 2 · answered Jul 13 '20 at 13:45

Step one

Your data is messy, let us tidy it up.

sample_data <- data.frame(
  fit = c(0.8971963, 1.4205607, 1.4953271, 0.8971963, 1.1588785, 0.1869159, 1.1588785, 1.142857143, 0.523809524),
  Xbeta = c(2.8907744,  -0.7680777,  -0.7278847, -0.06293916, -0.04047017, 2.3755812, 1.3043990,  -0.5698354, -0.5698354),
  Xgamma = c( 0.1180758, -0.6275700, 0.3731964,  -0.2353454,-0.5761923,  -0.5186803, 0.43041835, 3.9111749, -0.5030638),
  Xalpha = c(0.2643091, 1.6663923,  0.4041057, -0.2100472, -0.2100472, 7.4874195, -0.2385278,  0.3183102, -0.2385278),
  Xdelta = c(0.1498646, -0.6325119, -0.5947564, -0.2530748, 3.8413339, 0.6839322, 0.7401834,  3.8966404,  1.2028175)
)

tidyframe = data.frame(fit = sample_data$fit,
           X = c(sample_data$Xbeta,sample_data$Xgamma,sample_data$Xalpha,sample_data$Xdelta),
           type = c(rep("beta",9),rep("gamma",9),rep("alpha",9),rep("delta",9)))

^{Created on 2020-07-13 by the reprex package (v0.3.0)}

Step two

Iterate over each type, and get the P-value, using this nifty function

# From https://stackoverflow.com/a/5587781/3212698
lmp <- function (modelobject) {
  if (class(modelobject) != "lm") stop("Not an object of class 'lm' ")
  f <- summary(modelobject)$fstatistic
  p <- pf(f[1],f[2],f[3],lower.tail=F)
  attributes(p) <- NULL
  return(p)
}

Then do some clever piping


tidyframe %>% group_by(type) %>%
  summarise(type = type, p = lmp(lm(formula = fit ~ X))) %>%
  unique()
#> `summarise()` regrouping output by 'type' (override with `.groups` argument)
#> # A tibble: 4 x 2
#> # Groups:   type [4]
#>   type       p
#>   <fct>  <dbl>
#> 1 alpha 0.0902
#> 2 beta  0.148 
#> 3 delta 0.904 
#> 4 gamma 0.529

^{Created on 2020-07-13 by the reprex package (v0.3.0)}

R loop for linear regression lm(y~x) and save model output as a dataset

2 Answers2