Linear regression per group in Julia

Question

To do a linear regression in Julia we can use the function lm like this:

using DataFrames
using GLM

df = DataFrame(x=[2,3,2,1,3,5,7,4,2],
                      y=[2,3,5,1,5,6,4,2,3])
9×2 DataFrame
 Row │ x      y     
     │ Int64  Int64 
─────┼──────────────
   1 │     2      2
   2 │     3      3
   3 │     2      5
   4 │     1      1
   5 │     3      5
   6 │     5      6
   7 │     7      4
   8 │     4      2
   9 │     2      3

lm(@formula(y~x), df)
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  2.14516     1.11203   1.93    0.0951  -0.484383    4.77471
x            0.403226    0.303282  1.33    0.2254  -0.313923    1.12037
───────────────────────────────────────────────────────────────────────

But I was wondering how to do a linear regression per group in Julia. Here is some reproducible data:

df = DataFrame(group = ['A', 'A', 'A', 'A', 'A', 'A', 'A', 'A', 'A',
                               'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B', 'B'],
                      x=[2,3,2,1,3,5,7,4,2,3,4,5,2,6,3,1,6,1],
                      y=[2,3,5,1,5,6,4,2,3,3,2,4,7,1,8,4,3,1])
18×3 DataFrame
 Row │ group  x      y     
     │ Char   Int64  Int64 
─────┼─────────────────────
   1 │ A          2      2
   2 │ A          3      3
   3 │ A          2      5
   4 │ A          1      1
   5 │ A          3      5
   6 │ A          5      6
   7 │ A          7      4
   8 │ A          4      2
  ⋮  │   ⋮      ⋮      ⋮
  12 │ B          5      4
  13 │ B          2      7
  14 │ B          6      1
  15 │ B          3      8
  16 │ B          1      4
  17 │ B          6      3
  18 │ B          1      1
             3 rows omitted

So I was wondering if anyone knows how to perform a linear regression per group (in this case for group A and B in df) and get the statistical coefficients like p-value and R Square per group in Julia?

Dan Getz · Accepted Answer · 2022-12-22T13:07:25.853

UPDATE: In light of @matnor's comment about trouble in interpreting results of regression with baseline in answer, here is a better formula which gives clearer grouped results:

lm(@formula(y~0 + group + x & group), df)

With this regression the table is mostly self-explanatory. Note covariance still needs interpretation (but depending on context may be more applicable).

ORIGINAL ANSWER: It can be as simple as:

lm(@formula(y~1 + group + x*group), df)

GLM fits group as a categorical variable, adding a dummy coefficient (maybe today the PC crowd will change this name) for each group. The interaction term x*group adds another set of dummy coefficients. The first set, represents the intercept of each group, and the second represents the slope. Here are the results:

StatsModels.TableRegressionModel{...}

y ~ 1 + x + group + x & group

Coefficients:
──────────────────────────────────────────────────────
                  Coef.  Std. Error      t  Pr(>|t|)
──────────────────────────────────────────────────────
(Intercept)    2.14516     1.47762    1.45    0.1686
x              0.403226    0.402988   1.00    0.3340
group: B       2.62322     2.10649    1.25    0.2335
x & group: B  -0.723079    0.557198  -1.30    0.2154
──────────────────────────────────────────────────────

Note that group A doesn't appear because it is the baseline group (and corresponds to first intercept/slope pair).

If you look at the numbers, for example, for group B, you can see 2.62322 and -0.723079 which need to be added to the baseline to get slope/intercept of group:

julia> # 4.76838, -0.319853 are group B slope/intercept

julia> 2.14516 + 2.62322 ≈ 4.76838 # intercept
true

julia> 0.403226 + -0.723079 ≈ -0.319853 # slope
true

There are some benefits in terms of efficiency to this method, as well as added flexibility (GLM has more features).

Note the construction of GLM formulas is a mini-language. See docs for more info. `y ~ 1 + x * group` would have been enough, the way `*` is defined. — Dan Getz, Dec 22 '22 at 11:30
Another thing to note is that while the coefficient estimates are identical to separately estimating regressions for each group, standard errors will not be identical because the calculation of the covariance matrix is not identical. — matnor, Dec 22 '22 at 12:11
@matnor: The covariance for group factor model has more values. (2* (2*2) vs. (4*4). So doing the comparison between models is non-trivial (just as the simple coefficients needed to be combined with baseline). — Dan Getz, Dec 22 '22 at 12:23
@matnor: One final comment, after updating answer, the new covariance matrix has 8 non-zero elements. The relevant subblocks are scaled versions of smaller regressions (reflecting larger number of samples - something which needs context to decide what is the true sample size) — Dan Getz, Dec 22 '22 at 13:15
Yes, it is not that one or the other is wrong, it depends on the model assumption. The covariance matrix in the fully interacted model assumes homoskedasticity (equal variance of the error term for the different groups). This may or may not be a reasonable assumption. — matnor, Dec 23 '22 at 13:07

Shayan · Answer 2 · 2022-12-22T11:07:25.113

Use groupby function from DataFrames:

julia> gp = groupby(df, :group)
GroupedDataFrame with 2 groups based on key: group
First Group (9 rows): group = 'A': ASCII/Unicode U+0041 (category Lu: Letter, uppercase)
 Row │ group  x      y
     │ Char   Int64  Int64
─────┼─────────────────────
   1 │ A          2      2
   2 │ A          3      3
   3 │ A          2      5
   4 │ A          1      1
   5 │ A          3      5
   6 │ A          5      6
   7 │ A          7      4
   8 │ A          4      2
   9 │ A          2      3
⋮
Last Group (9 rows): group = 'B': ASCII/Unicode U+0042 (category Lu: Letter, uppercase)
 Row │ group  x      y
     │ Char   Int64  Int64
─────┼─────────────────────
   1 │ B          3      3
   2 │ B          4      2
   3 │ B          5      4
  ⋮  │   ⋮      ⋮      ⋮
   7 │ B          1      4
   8 │ B          6      3
   9 │ B          1      1
             3 rows omitted


julia> for df in gp
           @show lm(@formula(y~x), df)
       end
lm(#= REPL[12]:2 =# @formula(y ~ x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  2.14516     1.11203   1.93    0.0951  -0.484383    4.77471
x            0.403226    0.303282  1.33    0.2254  -0.313923    1.12037
───────────────────────────────────────────────────────────────────────
lm(#= REPL[12]:2 =# @formula(y ~ x), df) = StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
─────────────────────────────────────────────────────────────────────────
                 Coef.  Std. Error      t  Pr(>|t|)  Lower 95%  Upper 95%
─────────────────────────────────────────────────────────────────────────
(Intercept)   4.76838     1.79758    2.65    0.0328   0.517773   9.01899
x            -0.319853    0.460734  -0.69    0.5099  -1.40932    0.769609
─────────────────────────────────────────────────────────────────────────

And if you want to save the returned object by lm, then you can take the following approach:

julia> res = Vector{StatsModels.TableRegressionModel}(undef, 2);

julia> for (idx,df) in enumerate(gp)
           res[idx] = lm(@formula(y~x), df)
       end

julia> res[1]
StatsModels.TableRegressionModel{LinearModel{GLM.LmResp{Vector{Float64}}, GLM.DensePredChol{Float64, LinearAlgebra.CholeskyPivoted{Float64, Matrix{Float64}, Vector{Int64}}}}, Matrix{Float64}}

y ~ 1 + x

Coefficients:
───────────────────────────────────────────────────────────────────────
                Coef.  Std. Error     t  Pr(>|t|)  Lower 95%  Upper 95%
───────────────────────────────────────────────────────────────────────
(Intercept)  2.14516     1.11203   1.93    0.0951  -0.484383    4.77471
x            0.403226    0.303282  1.33    0.2254  -0.313923    1.12037
───────────────────────────────────────────────────────────────────────

Linear regression per group in Julia

2 Answers2