Making the syntax for random part of a mixed-model in lme4 shorter

Question

I was wondering if there might be a more concise way to write the syntax for the random part of the syntax as indicated below?

PS.: Essentially, I'm trying to get 2 uncorrelated set of random slopes for each level of my binary predictor lb_wght (see correlation matrix of random-effects below).

library(lme4)

math <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/mlgrp1.csv')
math <- within(math, lb_wght <- as.factor(lb_wght))

m <- lmer(
    math ~ 0 + lb_wght + lb_wght:grade + 
    (0 + dummy(lb_wght, "0") + dummy(lb_wght, "0"):grade|id) +  # This line and
    (0 + dummy(lb_wght, "1") + dummy(lb_wght, "1"):grade|id) ,  # This line
    data = math)

for those less familiar with lme4 could you give a verbal description of what this random-effects specification is doing? — Ben Bolker, Feb 08 '21 at 01:30
@rnorouzian As I've commented on your other questions, you really need to explain the structure of your data better. `lb_wght` is already the numeric representation of a dummy-coded binary variable, so all those calls to `dummy` are adding a lot of unnecessary complexity. — Livius, Feb 08 '21 at 12:29

Livius · Accepted Answer · 2021-02-08T21:04:31.753

I hesitate to answer this question because I suspect the underlying statistical model is problematic, but this is the programming arm of StackExchange, so here's the programming solution:

Important points:

A binary categorical variable represented as 0 and 1 is already dummy coded. It's certainly not problematic and at times useful to tell R that it's representing a categorical contrast via factor, but R will just turn that back into 0s and 1s before fitting that model.
0 and 1 can also be interpreted as boolean values. We can swap boolean values via negation.
Using dummy(lb_wght, "0") is just providing a reversed encoding of the original binary variable. Likewise, dummy(lb_wght, "1") is essentially a no-op -- it just gives you back the original dummy-contrasted variable.
We can express the reversed encoding of dummy(lb_wght, "0") as boolean negation.

data wrangling warm up

> library("lme4")
> 
> math <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/mlgrp1.csv')
> # A two-level contrast represented as dummy coding means we have 0 and 1
> # This is the same as FALSE and TRUE and so we can swap them using boolean negation
> math <- within(math, {
+                         lb_wght0 <- as.numeric(!lb_wght) 
+                         lb_wght1 <- lb_wght
+                      })
> head(math) # check what those new vars look like
    id female lb_wght anti_k1 math grade occ age men spring anti nb_wght lb_wght1 lb_wght0
1  201      1       0       0   38     1   2 111   0      1    0       1        0        1
2  201      1       0       0   55     3   3 135   1      1    0       1        0        1
3  303      1       0       1   26     0   2 121   0      1    2       1        0        1
4  303      1       0       1   33     3   3 145   0      1    2       1        0        1
5 2702      0       0       0   56     0   2 100   .      1    0       1        0        1
6 2702      0       0       0   58     2   3 125   .      1    2       1        0        1

no correlation in the RE

> m <- lmer(
+   math ~ 0 + lb_wght0 + lb_wght0:grade  + lb_wght1 +  lb_wght1:grade + 
+     (0 + lb_wght0 + lb_wght0:grade  + lb_wght1 +  lb_wght1:grade || id),   # This line and
+   data = math)
> summary(m)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ 0 + lb_wght0 + lb_wght0:grade + lb_wght1 + lb_wght1:grade +  
    ((0 + lb_wght0 | id) + (0 + lb_wght1 | id) + (0 + lb_wght0:grade |          id) + (0 + grade:lb_wght1 | id))
   Data: math

REML criterion at convergence: 15932

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.06601 -0.52971 -0.00312  0.53661  2.54121 

Random effects:
 Groups   Name           Variance Std.Dev.
 id       lb_wght0       61.9312  7.8696  
 id.1     lb_wght1       84.7006  9.2033  
 id.2     lb_wght0:grade  0.7044  0.8393  
 id.3     grade:lb_wght1  0.6598  0.8123  
 Residual                36.3585  6.0298  
Number of obs: 2221, groups:  id, 932

Fixed effects:
               Estimate Std. Error t value
lb_wght0       35.48689    0.36619   96.91
lb_wght1       32.81813    1.35307   24.25
lb_wght0:grade  4.29283    0.09059   47.39
grade:lb_wght1  4.86449    0.32029   15.19

Correlation of Fixed Effects:
            lb_wg0 lb_wg1 lb_w0:
lb_wght1     0.000              
lb_wght0:gr -0.533  0.000       
grd:lb_wgh1  0.000 -0.489  0.000

matching the correlation structure in the OP

> m <- lmer(
+   math ~ 0 + lb_wght0 + lb_wght0:grade  + lb_wght1 +  lb_wght1:grade + 
+     (0 + lb_wght0 + lb_wght0:grade | id) + 
+     (0 + lb_wght1 +  lb_wght1:grade | id),
+   data = math)
> summary(m)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ 0 + lb_wght0 + lb_wght0:grade + lb_wght1 + lb_wght1:grade +  
    (0 + lb_wght0 + lb_wght0:grade | id) + (0 + lb_wght1 + lb_wght1:grade |      id)
   Data: math

REML criterion at convergence: 15931.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.09189 -0.52867 -0.00063  0.53419  2.54169 

Random effects:
 Groups   Name           Variance Std.Dev. Corr 
 id       lb_wght0       61.1969  7.8228        
          lb_wght0:grade  0.6711  0.8192   0.04 
 id.1     lb_wght1       97.4494  9.8716        
          lb_wght1:grade  1.4314  1.1964   -0.35
 Residual                36.2755  6.0229        
Number of obs: 2221, groups:  id, 932

Fixed effects:
               Estimate Std. Error t value
lb_wght0        35.4848     0.3646   97.31
lb_wght1        32.8506     1.4214   23.11
lb_wght0:grade   4.2930     0.0902   47.60
grade:lb_wght1   4.8827     0.3414   14.30

Correlation of Fixed Effects:
            lb_wg0 lb_wg1 lb_w0:
lb_wght1     0.000              
lb_wght0:gr -0.527  0.000       
grd:lb_wgh1  0.000 -0.567  0.000

for comparison, the OP's model:

> math <- within(math, lb_wght <- as.factor(lb_wght))
> m <- lmer(
+   math ~ 0 + lb_wght + lb_wght:grade + 
+     (0 + dummy(lb_wght, "0") + dummy(lb_wght, "0"):grade|id) +  # This line and
+     (0 + dummy(lb_wght, "1") + dummy(lb_wght, "1"):grade|id) ,  # This line
+   data = math)
> summary(m)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ 0 + lb_wght + lb_wght:grade + (0 + dummy(lb_wght, "0") +  
    dummy(lb_wght, "0"):grade | id) + (0 + dummy(lb_wght, "1") +      dummy(lb_wght, "1"):grade | id)
   Data: math

REML criterion at convergence: 15931.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.09189 -0.52867 -0.00063  0.53419  2.54169 

Random effects:
 Groups   Name                      Variance Std.Dev. Corr 
 id       dummy(lb_wght, "0")       61.1969  7.8228        
          dummy(lb_wght, "0"):grade  0.6711  0.8192   0.04 
 id.1     dummy(lb_wght, "1")       97.4494  9.8716        
          dummy(lb_wght, "1"):grade  1.4314  1.1964   -0.35
 Residual                           36.2755  6.0229        
Number of obs: 2221, groups:  id, 932

Fixed effects:
               Estimate Std. Error t value
lb_wght0        35.4848     0.3646   97.31
lb_wght1        32.8506     1.4214   23.11
lb_wght0:grade   4.2930     0.0902   47.60
lb_wght1:grade   4.8827     0.3414   14.30

Correlation of Fixed Effects:
            lb_wg0 lb_wg1 lb_w0:
lb_wght1     0.000              
lb_wght0:gr -0.527  0.000       
lb_wght1:gr  0.000 -0.567  0.000

edit: 2021-02-08 20:57 UTC

NB This is not the question as asked, but instead what was demanded in a post-hoc comment. Nonetheless, the programmatic generation of indicator variables and a formula using base R is perhaps interesting. This could be rewritten to use apply .... but that seems like premature optimization for the problem at hand. I don't think it's a good idea to generalize this approach to more than two levels without really thinking about what that model would even mean, but again, this is an answer to the programming question, not an endorsement of the statistical procedure it implements.

do an arbitrary number of levels in the experimental variable

This will automatically generate all the indicator variables and the formula.

> # GOOD LUCK WITH THIS
> library("lme4")
> 
> math <- read.csv('https://raw.githubusercontent.com/rnorouzian/v/main/mlgrp1.csv')
> grps <- unique(math$lb_wght)
> 
> form <- "math ~ 0"
> 
> for(g in grps){
+   gname <- paste0("lb_wght",g)
+   math[, gname] <- ifelse(math$lb_wght == g, 1, 0) 
+   term <- paste("0", gname, paste0(gname,":grade"), sep="+")
+   re <- paste0("(", term, "|id)")
+   form <- paste(form, term, re, sep="+")
+ }
> 
> form <- as.formula(form)
> 
> m <- lmer(form, data = math)
> summary(m)
Linear mixed model fit by REML ['lmerMod']
Formula: math ~ 0 + 0 + lb_wght0 + lb_wght0:grade + (0 + lb_wght0 + lb_wght0:grade |  
    id) + 0 + lb_wght1 + lb_wght1:grade + (0 + lb_wght1 + lb_wght1:grade |      id)
   Data: math

REML criterion at convergence: 15931.1

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-3.09189 -0.52867 -0.00063  0.53419  2.54169 

Random effects:
 Groups   Name           Variance Std.Dev. Corr 
 id       lb_wght0       61.1969  7.8228        
          lb_wght0:grade  0.6711  0.8192   0.04 
 id.1     lb_wght1       97.4494  9.8716        
          lb_wght1:grade  1.4314  1.1964   -0.35
 Residual                36.2755  6.0229        
Number of obs: 2221, groups:  id, 932

Fixed effects:
               Estimate Std. Error t value
lb_wght0        35.4848     0.3646   97.31
lb_wght1        32.8506     1.4214   23.11
lb_wght0:grade   4.2930     0.0902   47.60
grade:lb_wght1   4.8827     0.3414   14.30

Correlation of Fixed Effects:
            lb_wg0 lb_wg1 lb_w0:
lb_wght1     0.000              
lb_wght0:gr -0.527  0.000       
grd:lb_wgh1  0.000 -0.567  0.000

Dear Livius, many thanks for taking the time to write this answer. I should say I already knew this answer, but was wondering if there might be a different coding alternative to either using `dummy()` or its exact mirror (what you kindly proposed). Ideas for such alternatives are, for example, reflected [**HERE**](https://rpsychologist.com/r-guide-longitudinal-lme-lmer#different-level-3-variance-covariance-matrix). In any case, in the above case, I'm trying to fit what's called a *multiple-group analysis* . . . — rnorouzian, Feb 08 '21 at 15:55
. . ., In the current model, 2 different models are being estimated using the same call to `lmer()`. That is, a model for each level of `lb_wght`. A fully multiple-group analysis would be possible if we allow variances between each level of `lb_wght` to be different as well which is what we discussed [**HERE**](https://stats.stackexchange.com/a/508220/140365). — rnorouzian, Feb 08 '21 at 15:57
@rnorouzian Once again, you've got an [XY Problem](https://meta.stackexchange.com/questions/66377/what-is-the-xy-problem). Please ask questions about your actual problem and not about implementation details unless you want an answer to your implementation details. — Livius, Feb 08 '21 at 16:44
Note also that "2 different models are being estimated...." is not true: a single model is being estimated (with corresponding partial pooling), but there is a particular blocking structure to the random-effects covariance matrix. I also have my doubts that the generalized variant of this covariance structure is particularly useful -- why are you assuming that structure? What does that structure mean in real-world terms for your data when you have more than 2 groups? What is the inferential goal? What do your data represent? — Livius, Feb 08 '21 at 16:47
Dear Livius, please refer to Chapter 6 of [This Book](https://www.guilford.com/books/Growth-Modeling/Grimm-Ram-Estabrook/9781462526062). If you can't access a copy of the book, I would be happy to send you one (via email). I'm studying the book right now for personal interest. — rnorouzian, Feb 08 '21 at 16:51
@rnorouzian In the question, the problem is explicitly stated as "binary predictor". That question has been answered. If you have a different question, then you should open a new question. Also, it is incumbent upon the person asking the question to put enough information in the question so that people can understand the question -- the person answering shouldn't have to look for information elsewhere. Finally, as it is now explicitly clear that this is `self-study`, questions on CrossValidated should include the `self-study` tag. — Livius, Feb 08 '21 at 17:09
Dear Livius. I don't have a statistical question. I simply wanted to know if there is more smarter way to define the random-part of the syntax for example using one categorical variable (`lb_wght`) in the model. A purely R syntax question. However, since you "suspect the underlying statistical model is problematic", I prepared the chapter of the book in which this example is very clearly explained for your consideration [**HERE**](https://github.com/hkil/m/blob/master/G.pdf). You can just read pp. 115-123. Once again, many thanks for your insights. — rnorouzian, Feb 08 '21 at 19:09