Running multiple linear regressions across several columns of a data frame in R

Question

I have a dataset structured as such: enter image description here

I would like to run linear regression models and ANOVA using V1, V2...etc. as the independent variables and the g column as the dependent variable in each case (i.e. lm(V1 ~ g), lm(V2 ~ g), and so forth). This would be straightforward except that these linear regressions need to be grouped by level in the pair column, such that, for example, my output contains lm(V1 ~ g) for all rows with pair 1.1 and lm(V1 ~ g) for all pairs 1.201, etc.

I've tried a number of approaches using for loops, lapply and the data.table package, and nothing gives me exactly the output I'd like. Can anyone give me any insight on the best way to tackle this problem?

Edit: My full data set has 7056 different pairs in the pair column and 100 V columns (V1...V100). My latest attempt at this problem:

df$pair <- as.factor(df$pair)
out <- list()
for (i in 3:ncol(df)){
    out[[i]] <- lapply(levels(df$pair), function(x) {
    data.frame(df=x, g = coef(summary(lm(df[,i]~ df$g, data=df[df$pair==x,])),row.names=NULL))})
    }

Dave Gruenewald · Accepted Answer · 2019-02-27T16:51:59.933

Let's get some tidyverse power here, along with broom, and forgo all these loops...

First I'll make a dummy table:

df <- data.frame(
  g = runif(50), 
  pair = sample(x = c("A", "B", "C"), size = 50, replace = TRUE), 
  V1 = runif(50), 
  V2 = runif(50), 
  V3 = runif(50), 
  V4 = runif(50), 
  V5 = runif(50),
  stringsAsFactors = FALSE
)

That's approximately what your data structure looks like. Now onto the meat of the code:

library(tidyverse)
library(broom)

df %>% 
  as_tibble %>% 
  gather(key = "column", value = "value", V1:V5) %>%       # first set the data in long format
  nest(g, value) %>%                                       # now nest the dependent and independent factors
  mutate(model = map(data, ~lm(g ~ value, data = .))) %>%  # fit the model using purrr
  mutate(tidy_model = map(model, tidy)) %>%                # clean the model output with broom
  select(-data, -model) %>%                                # remove the "untidy" parts
  unnest()                                                 # get it back in a recognizable data frame

Which gives us the following:

# A tibble: 30 x 7
   pair  column term        estimate std.error statistic  p.value
   <chr> <chr>  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
 1 C     V1     (Intercept)  0.470       0.142    3.31   0.00561 
 2 C     V1     value        0.125       0.265    0.472  0.645   
 3 B     V1     (Intercept)  0.489       0.142    3.45   0.00359 
 4 B     V1     value       -0.0438      0.289   -0.151  0.882   
 5 A     V1     (Intercept)  0.515       0.111    4.63   0.000279
 6 A     V1     value       -0.00569     0.249   -0.0229 0.982   
 7 C     V2     (Intercept)  0.367       0.147    2.50   0.0265  
 8 C     V2     value        0.377       0.300    1.26   0.231   
 9 B     V2     (Intercept)  0.462       0.179    2.59   0.0206  
10 B     V2     value        0.0175      0.322    0.0545 0.957   
# … with 20 more rows

yep, that's a beaut! Note that I used lm(g ~ value) instead of lm(value ~ g) as this is what your text description alluded to.

jfeuerman · Answer 2 · 2019-02-27T15:06:30.567

Using the tidyverse package to filter your dataframe:

library(tidyverse)

lm(V1~g, data=filter(yourData, pair==1.1))
lm(V2~g, data=filter(yourData, pair==1.201))

This ensures that you are eliminating rows that don't contain your desired pair value for each regression model. You could probably create a loop to do this, but I think it's easier to just manually keep filtering through the pair values. If you really want to use a loop, here's a fairly simple way to do it:

for (i in levels(yourData$pair)) {
  if (i==1.1) {
    mod1 <- lm(V1~g, data=filter(yourData, pair==i))
  }

  if (i==1.201) {
    mod2 <- lm(V2~g, data=filter(yourData, pair==i))
  }
}

But this is still manually looping through the levels of pair. I would have to see your whole data set to automate the looping process.

Also, if the g column contains your dependent values, the call should be lm(g~V1), lm(g~V2), etc. It shouldn't be lm(V1~g).

Thanks for the suggestion! In my full data set, I have 7056 different pairs, and 100 different V columns (V1...V100). How would you suggest automating the looping process for both, given that manually entering each is less than ideal? — user1280, Feb 27 '19 at 15:36

Running multiple linear regressions across several columns of a data frame in R

2 Answers2

Linked