0

I originally posted this question on Cross Validated Stackexchange, and got no answer. Therefore I decided to give it a go here. I am trying to figure out how to obtain lsmeans for a piecewise linear mixed-effects model (fitted with the nlme package) with random intercepts and slopes. My data represent math scores from a group of male and female students taking a test every week before and after the introduction of a daily routine of meditation. A minimal reproducible example to create the data frame and fit the model is as follows:

library(nlme)
library("lsmeans")

# Subject's ID
ID <- c(1,1,1,
        2,2,2,
        3,3,3,
        4,4,4,
        5,5,5,
        6,6,6,
        7,7,7,
        8,8,8)
# Time (weeks) before introduction of routine
time1 <- c(-1,0,0,
           -1,0,0,
           -1,0,0,
           -1,0,0,
           -1,0,0,
           -1,0,0,
           -1,0,0,
           -1,0,0)
# Time (weeks) before introduction of routine
time2 <- c(0,0,1,
           0,0,1,
           0,0,1,
           0,0,1,
           0,0,1,
           0,0,1,
           0,0,1,
           0,0,1)

# week test math scores 
mscore <- c(80,92,73,
           75,80,85,
           60,75,70,
           75,80,75,
           78,84,75,
           78,91,95,
           64,72,71,
           84,92,70)

# create dataframe
longdata <-data.frame(ID, time1, time2, mscore)
head(longdata)

# fit model
pwmodel <- lme(mscore ~ time1+time2,
                      random =~ time1+time2|ID,
                      data=longdata,
                      method="ML")

# calculate marginal means:

#with both variables
lsmeans(pwmodel, ~(time1+time2), 
        at=list(time1=c(-1,0,1), time2=c(-1,0,1)) )

# only with one variable
lsmeans(pwmodel, ~time1, 
        at=list(time1=c(-1,0,1) ))

Here time1 and time2 represent the time before and after the start of the daily meditation routine.

The question is: what is the correct way to obtain lsmeans (or emmeans if that is better) from this model at times -1, 0 and 1? Considering the two time variables or only one of them (either time1 or time2)?

The outputs of both approaches are shown below:

> #with both variables
> lsmeans(pwmodel, ~(time1+time2), 
+         at=list(time1=c(-1,0,1), time2=c(-1,0,1)) )
 time1 time2 lsmean   SE df lower.CL upper.CL
    -1    -1   80.8 5.46  7     67.8     93.7
     0    -1   89.8 5.47  7     76.8    102.7
     1    -1   98.8 5.81  7     85.0    112.5
    -1     0   74.2 2.88  7     67.4     81.1
     0     0   83.2 2.77  7     76.7     89.8
     1     0   92.2 3.28  7     84.5    100.0
    -1     1   67.8 3.34  7     59.9     75.6
     0     1   76.8 3.12  7     69.4     84.1
     1     1   85.8 3.47  7     77.5     94.0

Degrees-of-freedom method: containment 
Confidence level used: 0.95 
> # only with one variable
> lsmeans(pwmodel, ~time1, 
+         at=list(time1=c(-1,0,1) ))
 time1 lsmean   SE df lower.CL upper.CL
    -1     71 2.59  7     64.9     77.1
     0     80 2.38  7     74.4     85.6
     1     89 2.89  7     82.2     95.8

Results are averaged over the levels of: time2 
Degrees-of-freedom method: containment 
Confidence level used: 0.95 

They clearly return different results, but shouldn't both ways give the same value?

Leo
  • 1
  • 1
  • Could we please have a [mcve]? Would you be willing to update your question to use `emmeans`, which is newer and better supported? – Ben Bolker Nov 18 '21 at 22:26
  • Thank you for the observation, I followed your suggestion to try to improve the question. I am not very familiar with emmeans, but I am open to start to use it if that helps to clarify the interpretation of the marginal means for this problem. – Leo Nov 19 '21 at 12:28
  • The lsmeans results here make no sense, because we have averaged over time2 values when in fact you can't vary time2 independently of time 1. – Russ Lenth Nov 21 '21 at 01:43

1 Answers1

0

You have a very awkward parameterization, in that there is really only one concept of "time", not two. I suggest defining a dataset with only the real time variable, and fitting a model that accounts for the break point at 0:

library(nlme)
library(emmeans)

dat <- data.frame(
    ID = rep(1:8, each = 3),
    time = rep(-1:1, 8),
    mscore = c(80,92,73,
               75,80,85,
               60,75,70,
               75,80,75,
               78,84,75,
               78,91,95,
               64,72,71,
               84,92,70)
)

mod <- lme(mscore ~ time:(1 + (time > 0)), ~ time|ID, data = dat)

lsmeans(mod, "time", cov.reduce = FALSE)
##  time lsmean   SE df lower.CL upper.CL
##    -1   74.2 3.20  7     66.7     81.8
##     0   83.2 2.68  7     76.9     89.6
##     1   76.8 2.88  7     70.0     83.5
## 
## Degrees-of-freedom method: containment 
## Confidence level used: 0.95

emmip(mod, ~ time, at = list(time = seq(-1, 1, by = .25)))

Created on 2021-11-20 by the reprex package (v2.0.0)

Russ Lenth
  • 5,922
  • 2
  • 13
  • 21
  • thanks for the answer. Indeed it is a strange parametrization, I found it here: dpmartin42.github.io/posts/Piecewise-growth I am worried about the fact that, although the model slopes result the same for both parametrizations, the standard errors are different. Strangely, in the model with the parametrization that you suggested the standard error for both slopes are the same, whereas for the two-variable parametrization the standard error for each slope are different. – Leo Nov 20 '21 at 17:28
  • Well, I cheated and used a simpler random effect in my model. If you add time:(time > 0) to the RE, you'll get the same SEs, I think. – Russ Lenth Nov 20 '21 at 18:28
  • BTW, I do not recommend using ML fits for post hoc analyses, use REML instead to avoid bias. Another difference between our models. – Russ Lenth Nov 20 '21 at 18:33
  • Thanks for the useful answer. Just one question more, if I wanted to add a second further break point (say at t=2), how could I adapt the model to account for all break points? – Leo Nov 21 '21 at 04:22
  • Just add ana other term inside the parentheses in the model formula – Russ Lenth Nov 22 '21 at 18:53
  • Thanks a lot for the reply, I still do not get it though. I am confused now as to how include a second term and whether the symbol should only be `time>0` or `time>=0` (also in the second term `time>2` or `time>=2` ). Does that make a difference when the dataset is highly unbalanced? – Leo Nov 23 '21 at 19:26