-3

I have a data set with columns Y, X1, X2 and V. While Y, X1 and X2 are continuous, V is a categorical variable. Assuming V has 10 categories, I want to create 10 linear regression models and store the results (coefficients, p-values, R-Sq, etc) in another table. Is there a way to do it with data.table without using for loops? Thanks.

mlg
  • 75
  • 5
  • 3
    See `lmList` in the nlme package. See [mcve] for information how to provide a reproducible example when asking a question on SO. – G. Grothendieck Sep 14 '16 at 19:53
  • Thanks. I tried lmList, it worked. I got only the coeffs, but I am sure I can figure out how to get R-Sq, p-values etc. – mlg Sep 14 '16 at 20:12
  • See http://stackoverflow.com/questions/23501852/print-r-squared-for-all-of-the-models-fit-with-lmlist – G. Grothendieck Sep 14 '16 at 22:52

2 Answers2

3

The base R function by is what you want.

# make up some sample data
dataSet <- data.frame(Y = iris$Sepal.Length, 
                      X1 = iris$Sepal.Width, 
                      X2 = iris$Petal.Length, 
                      V = iris$Species)
# apply the `lm` function by the value of `V`
by(data = dataSet[c("Y","X1","X2")], 
   INDICES = dataSet$V, 
   FUN = lm, 
   formula = Y ~ .)

In the by function, data is the data you want to apply the function to. INDICES is a vector of factors or list of factors with one value corresponding to each row of data indicating how you want the data split up. FUN is the function you want applied to the subsets of your data. In this case, lm() needs the extra parameter formula indicating how you want to model your data, so you can easily pass that as and extra formula parameter in the by function.

Barker
  • 2,074
  • 2
  • 17
  • 31
0

The broom package exists exactly for this type of problem. It 'tidies' the output of models into neat data frames for easy storage and comparison. Here is an example that uses broom and dplyr to solve a near identical problem. It uses dplyr to group the data by a categorical variable, fits a model to each group, and extracts the coefficients into a data.frame in just a few lines of code. I am unfamiliar with data.table's grouped operation, but it may be possible to perform something similar with the package.

Additionally, broom has the augment function, which can be used to calculate goodness-of-fit metrics and other summary statistics.

Alternatively, if you want to do it without installing additional packages, you could split your data frame into a list (using the split function), lapply the modeling process to the list, extract the results (probably through another lapply that extracts info from the lm object,) and then rbind it all together.

Mir Henglin
  • 629
  • 5
  • 15