How to decide when and how to include covariates in a linear mixed-effects model in lme4

Question

I am running a linear mixed-effects model in R, and I'm not sure how to include a covariate of no interest in the model, or even how to decide if I should do that.

I have two within-subject variables, let's call them A and B with two levels each, with lots of observations per participant. I'm interested in how their interaction changes across 4 groups. My outcome is reaction time. At the simplest level, I have this model:

RT ~ 1 + A*B*Groups + (1+A | Subject ID)

I would like to add Gender as a covariate of no interest. I have no theoretical reason to assume it affects anything, but it's really imbalanced across groups, so I'd like to include it. The first part of my question is: What is the best way to do this?

Is it this model:

RT ~ 1 + A*B*Groups + Gender + (1+A | Subject ID)

or this:

RT ~ 1 + A*B*Groups*Gender + (1+A | Subject ID)

? Or some other way? My worries about this second model is that it somewhat unreasonably inflates the number of terms in the model. Plus I'm worried about overfitting.

The second part of my question: When selecting the best model, when should I add the covariate to see if it makes any difference at all? Let me explain what I mean.

Let's say I start with the simplest model I mentioned above, but without the slope for A, so this:

RT ~ 1 + A*B*Groups + (1| Subject ID)

Should I add the covariate first, either as a main effect ( + Gender) or as part of the interaction (*Gender), and then see if adding a slope for A makes a difference (by using the anova() function), or can I go ahead with adding the slope (which is theoretically more important) first, and then see if gender matters at all?

score 1 · Answer 1 · answered Apr 20 '18 at 11:05

Following are some suggestions regarding your two questions.

I would recommend an iterative modelling strategy.

Start with
```
RT ~ 1 + A*B*Groups*Gender + (1+A | Subject ID)
```
and see if the problem is tractable. Above model will include both additive effects as well as all interaction terms between A, B, Groups and Gender.

If the problem is not tractable, discard the interaction terms between Gender and the other covariates, and model
```
RT ~ 1 + A*B*Groups + Gender + (1+A | Subject ID)
```
It's difficult to make a statement about potential overfitting without any details on the number of observations.
Concerning your second question: Generally, I would recommend a Bayesian approach; take a look at the rstan-based brms R package, which allows you to use the same lme4/glmm formula syntax, making it easy to translate models. Model comparison and predictive performance are very broad terms. There exist various ways to explore and compare the predictive performance of these type of nested/hierarchical Bayesian models. See for example the papers by Piironi and Vehtari and Vehtari and Ojanen.

Thanks for your response! By tractable, do you mean whether the model converges or not? I have 25-30 subjects in each group (108 altogether) with around 280 observations per each subject (around equal numbers, so 70, in each cell of A x B). — MGy, Apr 20 '18 at 12:19
@MGy Tractable as in robust & reproducible estimates can be made in polynomial time. Based on the number of observations per group and per subject, I would say that overfitting & non-tractability shouldn't be too much of an issue even with the more complex model. But then I'm not that familiar with `lme4` because I prefer using `rstan` to fit nested models where I can give weakly informative (regularising) priors to help with robustness of parameter estimates. — Maurits Evers, Apr 21 '18 at 01:28

How to decide when and how to include covariates in a linear mixed-effects model in lme4

1 Answers1