How to first group_by() then iterate lm() through columns?

Question

Let's say we have a data frame with a set of 3 dependent variables and 6 independent variables tagged by a grouping variable. An example of this format is generated with the sample code below:

library(tidyverse)
library(broom)
n  <- 15
df  <- data.frame(groupingvar= sample(letters[1:2], size = n, replace = TRUE),
                  y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
                  x1=  rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
                  x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df <- arrange(df,groupingvar)

If I wanted to regress each of the y1, y2, y3 on the set of x1 through x6 I could use something along the lines of:

y <- as.matrix(select(df,y1:y3))
x <- as.matrix(select(df,x1:x6))
regs <-lm(y~x)
coeffs <- tidy(regs)
coeffs <- arrange(coeffs,response, term)

(by making use of the following line from the lm() help: "If response is a matrix, a linear model is fitted separately by least-squares to each column of the matrix.")

However, if I need to first group by the grouping variable and then apply the lm function then I'm not quite sure how to do it. I have tried the following, but it produces the same set of coefficients for both groups.

regs2 <- df %>% group_by(groupingvar) %>%
  do(fit2 = lm(as.matrix(select(df,y1:y3)) ~ as.matrix(select(df,x1:x6))))
coeffs2 <- tidy(regs2,fit2)
coeffs2 <- arrange(coeffs2,groupingvar, response)

"and then apply the lm function" `->` Have you tried using `lapply()`? — d8aninja, Mar 21 '17 at 20:22
I'm not sure how to properly use it. I've tried to create a formula list with elements "y1~x1+x2+...+x6", "y2~x1+x2+...+x6, "y3~x1+x2+...+x6", and tried to pass this list to lm(), but I think I trip up on the proper syntax. — user1689945, Mar 21 '17 at 21:00
The apply, sapply, lapply, etc family is absolutely paramount to your understanding. Must learn. There are infinite resources that will teach you much better than any answer here. See Hadley's Advanced R (available online) or many examples in the bookdown library — d8aninja, Mar 21 '17 at 21:02
Thank you for the reference, I will study it. Would you mind sharing how you would approach both steps of this example with lapply (i.e. the group_by() step and the step that iterates lm())? — user1689945, Mar 22 '17 at 01:40

MichaelChirico · Accepted Answer · 2017-03-21T21:52:13.893

1

In data.table, you could melt (reshape long -- stack the outcome variables in one column instead of stored in three columns) & lm by both groupingvar and the outcome variable:

library(data.table)
setDT(df)

#alternatively, set id.vars = c('groupingvar', paste0('x', 1:6)), etc.
longDT = melt(df, id.vars = grep('y', names(df), invert = TRUE))

#this helper function basically splits a named vector into
#  its two components
coefsplit = function(reg) {
  beta = coef(reg)
  list(var = names(beta), coef = beta)
}

#I personally wouldn't assign longDT, I'd just chain this onto
#  the output of melt;
longDT[ , coefsplit(lm(value ~ ., data = .SD)), by = .(groupingvar, variable)]
#     groupingvar variable         var          coef
#  1:           a       y1 (Intercept) -3.595564e+03
#  2:           a       y1          x1 -3.796627e+01
#  3:           a       y1          x2 -1.557268e+02
#  4:           a       y1          x3  2.862738e+02
#  5:           a       y1          x4  1.579548e+02
# ...
# 38:           b       y3          x2  2.136253e+01
# 39:           b       y3          x3 -3.810176e+01
# 40:           b       y3          x4  4.187719e+01
# 41:           b       y3          x5 -2.586184e+02
# 42:           b       y3          x6  1.181879e+02
#     groupingvar variable         var          coef

edited Mar 21 '17 at 21:52

answered Mar 21 '17 at 20:35

MichaelChirico

33,841
14
113
198

Thank you. I think I follow your answer. I was able to run it in R and it works. A few notes/questions: (1) I also tried to replicate the results in Excel with the linest() function, and the result match only for the first grouping variable, whereas they are significantly different for the second grouping variable. Is the least squares algorithm of R and Excel different? (2) I agree that I shouldn't assign longDT, as the number of rows will grow significantly, but how would you pipe the output of melt so that each coefficient is also properly tagged with intercept, x1, etc.? – user1689945 Mar 21 '17 at 21:43
The results should be the same (presumably both are using (X'X)^(-1)X'Y). Can you be specific about which group is returning an incorrect estimate, and what the output of R vs. Excel is? – MichaelChirico Mar 21 '17 at 21:46
Let's set the seed for example with set.seed(100). Then for the grouping variable a, the intercept of y1 is 0.2256. This matches excel's output of the linest() function. However, for the grouping variable b the intercept of y1 is -3.3755, versus excel's output of 6.8732. – user1689945 Mar 22 '17 at 01:33
I found the problem: n=15 produces 6 rows for the b grouping variable, which is the same as the number of independent variables. If I use n=20 with set.seed(100), then I get intercepts of 1.05464 and 5.75549 for y1 for groups a and b respectively with both R and Excel. I'll accept your solution then in the thread. Thank you! – user1689945 Mar 22 '17 at 03:00
Ah, right. Well I guess R is dropping variables automatically. Excel may do the same in a different fashion, hence the divergence. – MichaelChirico Mar 22 '17 at 03:10

score 0 · Answer 2 · answered May 24 '17 at 15:17

I also found a way to achieve this using cbind() as follows:

library(tidyverse)
library(broom)
n  <- 20
df4  <- data.frame(groupingvar= sample(1:2, size = n, replace = TRUE),
                   y1 = rnorm(n,10,1), y2=rnorm(n,100,10), y3=rnorm(n,1000,100),
                   x1=  rnorm(n,10,1), x2=rnorm(n,10,1), x3=rnorm(n,10,1),
                   x4=rnorm(n,10,1), x5=rnorm(n,10,1), x6=rnorm(n,10,1))
df4 <- arrange(df4,groupingvar)

regs <- df4 %>% group_by(groupingvar) %>%
  do(fit = lm(cbind(y1,y2,y3) ~ . -groupingvar, data = .))
coeffs <- tidy(regs, fit)

How to first group_by() then iterate lm() through columns?

2 Answers2