0

I used lmer from the lme4 package to run a linear mixed effects model. I have 3 years of temperature data for untreated (5) and treated plots (10). The model:

modela<-lmer(ave~yr*tr+(1|pl), REML=FALSE, data=mydata)

Model checked for normality of residuals; qqnorm plot My data:

'data.frame':   6966 obs. of  7 variables:

$ yr : Factor w/ 3 levels "yr1","yr2","yr3": 1 1 1 1 1 1 1 1 1 1 ...

$ pl : Factor w/ 15 levels "C02","C03","C05",..: 1 1 1 1 1 1 1 1 1 1 ...

$ tr : Factor w/ 2 levels "Cont","OTC": 1 1 1 1 1 1 1 1 1 1 ...

$ ave: num  14.8 16.1 11.6 10.3 11.6 ...

The interaction is significant, so I used lsmeans:

lsmeans(modela, pairwise~yr*tr, adjust="tukey")

In the contrasts, I get (excerpts only)

contrast                estimate        SE      df t.ratio p.value

yr1,Cont - yr2,Cont -0.727102895 0.2731808 6947.24  -2.662  0.0832

yr1,OTC - yr2,OTC   -0.990574030 0.2015650 6449.10  -4.914  <.0001

yr1,Cont - yr1,OTC  -0.005312771 0.3889335   31.89  -0.014  1.0000

yr2,Cont - yr2,OTC  -0.268783907 0.3929332   32.97  -0.684  0.9825

My question regards the high dfs for some of the contrasts, and associated, but meaningless low p-values.

Can this be due to:

-presence of NA's in my data set (some improvement when removed)

-unequal sample sizes (e.g. 5 of one treatment, 10 of the other - however, those (yr1,Cont - yr1, OTC) don't seem to be a problem.

Other issues?

I have searched stakoverflow questions, and crossvalidated.

Thanks for any answers, ideas, comments.

zx8754
  • 52,746
  • 12
  • 114
  • 209
S.Dietz
  • 1
  • 1
  • Those df may be right if `yr` is a within-`pl` factor. Note that for `lmer` models, `lsmeans` provides a choice of two d.f. methods -- one is the Satterthwaite method based on routines in the **lmerTest** package (the default), and the Kenward-Roger method using routines in the **pbkrtest** package. Try adding the argument `lmer.df = "k"` to the `lsmeans` call and see what the results are for the latter method. If they are comparable, I'd trust that there isn't a problem here. Unrelated note -- I wonder why you opted for `REML=FALSE` in fitting the model. – Russ Lenth Apr 21 '17 at 14:04
  • By the way, the P values are not going to be all that much larger than those shown unless the d.f. are really small. – Russ Lenth Apr 21 '17 at 14:09
  • From the context of this question, I'm guessing that if you plot residuals versus run order (day??) separately for each plot, you will find yourself not so satisfied with the suitability of your model. I further imagine that there are substantial serial correlations in each plot. – Russ Lenth Apr 21 '17 at 16:49
  • Thanks @rvl. I opted for REML because I selected my best model using maximum likelihood (a result of discussions with other R-users, and learning R at the time I started my analysis). Yes, I tried both d.f. methods, and results are very similar. How is the huge difference in dfs explained? In the 30-40's for the treatment within year, and in the 6000 for the year within treatment? These are my sample size: Yr1,Cont n=810, Yr1,OTC n=1619, Yr2,Cont n=809, Yr2,OTC n=1458, Yr3,Cont n=809, Yr3,OTC n=1457. OTC/Cont is the treatment. – S.Dietz Apr 22 '17 at 12:04
  • Because the plots are the experimental units for treatments. It comes from the (1|pl) part in the model. – Russ Lenth Apr 22 '17 at 12:12
  • My apologies, @rvl, but I still don't understand this. OTCs have 10 plots, cont have 5 plots. – S.Dietz Apr 22 '17 at 12:35

1 Answers1

1

In this example, treatments are assigned experimentally to plots. Having small numbers of plots assigned to treatments severely limits the information available to statistically compare the treatments. (If you had only one plot per treatment, it would not even be possible to compare treatments, because you wouldn't be able to sort out the effect of the treatments from the effect of the plots.) You have 10 plots assigned to one treatment and 5 to the other. In terms of the main effect for treatment, you thus have (10-1)+(5-1) = 13 d.f. for the main effect of treatment, and if you do

lsmeans(modela, pairwise ~ tr)

you will see around 13 d.f. (maybe less due to imbalance and missingness) for those statistics. When you compare combinations of years and treatments, you get roughly 3 times the d.f. because there are 3 years. However, in some of those comparisons, year is that same in each combination being compared, and in those comparisons, the variation in plots mostly cancels out (it is a within-plot comparison); and in those cases, the d.f. basically come from the residual error for the model, which has thousands of d.f. Due to imbalances in the data, these comparisons are a little bit polluted by the between-plot variations, making the d.f. somewhat smaller than the residual d.f.

It appears you are not particularly interested in cross-comparisons such as treat1, year1 vs. treat2, year3. I suggest using "by" variables to cut down on the number of comparisons tested, because when you test them all, the multiplicity correction is unnecessarily conservative. It would go something like this:

modela.lsm = lsmeans(modela, ~ tr * yr)
pairs(modela.lsm, by = "yr")   # compare tr for each yr
pairs(modela.lsm, by = "tr")   # compare yr for each tr

These calls will apply the Tukey correction separately to each "by" group. If you want a multiplicity correction for each whole family, do this:

rbind(pairs(modela.lsm, by = "yr"))
rbind(pairs(modela.lsm, by = "tr"))

By default, a multivariate t correction is used (Tukey is not the right method here). You can even do

rbind(pairs(modela.lsm, by = "yr"), pairs(modela.lsm, by = "tr"))

to group all of the comparisons into one family and apply a multivariate t adjustment.

Russ Lenth
  • 5,922
  • 2
  • 13
  • 21
  • Thanks, this is helpful. I think using lsmeans (model, ~ tr|yr) does this also. You are right, I am not interested in the year comparisons. – S.Dietz Apr 22 '17 at 16:14