R-multiple comparison with unbalanced design output differs from Anova

Question

For my experiment, I clipped plants and measured their responses, such as leaf mass produced, at the end of the season. I manipulated both clipping intensity and clipping time and crossed these two treatments. I also included a control clipped treatment resulting in 5 different clipping treatment combinations. With 12 plants per treatment there is a total of 60 plants which I followed over the course of two years. That is, I collected measurements on these 60 plants in year 1 and the same plants again in year 2.

Here is my design, where "never" under timing and "zero" under intensity arbitrarily replaced "control" treatment:

 Year   Timing  intensity   treatments
 2015   early   high       early-high
 2015   early   low        early-low
 2015   late    high       late-high
 2015   late    low        late-low
 2015   never   zero       control
 2014   early   high       early-high
 2014   early   low        early-low
 2014   late    high       late-high
 2014   late    low        late-low
 2014   never   zero       control

I followed one suggestion by Ben Bolker to ignore the warnings from running lme4 and afterwards run an F-test on the model (R- analyzing repeated measures unbalanced design with lme4?):

m1<-lmer(log(plant.leaf.g)~timing*intensity*year+(1|id), data=cmv)

Anova(m1, type="III", test="F")

The anova output gave me significant interactions between timing and intensity (p=0.006), and I followed up with a multiple comparison test using:

cmv$SHD<-interaction(cmv$timing, cmv$intensity)
m2<-lmer(log(plant.leaf.g)~-1+SHD+(1|id),data=cmv,  na.action=na.exclude)
summary(glht(m2, linfct=mcp(SHD="Tukey")))

Here is a clip of my output with the only significant pair where p=0.08:

                                Estimate Std. Error z value Pr(>|z|)  
late.2014 - early.2014 == 0   -0.6584     0.3448  -1.910   0.3844  
never.2014 - early.2014 == 0   0.1450     0.4102   0.354   0.9992  
early.2015 - early.2014 == 0  -0.4906     0.2786  -1.761   0.4788  
late.2015 - early.2014 == 0   -0.1687     0.3494  -0.483   0.9965  
never.2015 - early.2014 == 0   0.4201     0.4079   1.030   0.9032  
never.2014 - late.2014 == 0    0.8034     0.4119   1.951   0.3597  
early.2015 - late.2014 == 0    0.1678     0.3419   0.491   0.9963  
late.2015 - late.2014 == 0     0.4897     0.2724   1.797   0.4553  
never.2015 - late.2014 == 0    1.0785     0.4119   2.618   0.0885 .
early.2015 - never.2014 == 0  -0.6356     0.4074  -1.560   0.6133

Why did Anova deem timing*intensity to be highly significant, but no significance shows up in my multiple comparison test? Is there another way I should be doing the multiple comparison?

In other multiple comparison outputs I get p-values as high as 1.00000, is this normal?

data<-structure(list(id = c(91L, 92L, 93L, 94L, 95L, 96L, 97L, 98L, 
99L, 100L, 101L, 102L, 103L, 105L, 106L, 107L, 108L, 109L, 110L, 
111L, 112L, 113L, 114L, 115L, 116L, 117L, 119L, 120L, 121L, 122L, 
123L, 124L, 125L, 126L, 127L, 128L, 129L, 130L, 131L, 132L, 133L, 
134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 144L, 
146L, 147L, 148L, 149L, 150L, 91L, 92L, 93L, 94L, 95L, 96L, 97L, 
98L, 99L, 100L, 101L, 102L, 103L, 105L, 106L, 107L, 108L, 109L, 
110L, 111L, 112L, 113L, 114L, 115L, 116L, 117L, 119L, 120L, 121L, 
122L, 123L, 124L, 125L, 126L, 127L, 128L, 129L, 130L, 131L, 132L, 
133L, 134L, 135L, 136L, 137L, 138L, 139L, 140L, 141L, 142L, 143L, 
144L, 146L, 147L, 148L, 149L, 150L), quad = c(2L, 2L, 2L, 2L, 
2L, 2L, 2L, 2L, 2L, 2L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 
6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 
3L, 3L, 3L, 3L, 3L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 5L, 
5L, 5L, 5L, 5L, 5L, 5L, 5L, 5L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 6L, 
6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 4L, 4L, 4L, 4L, 4L, 
4L, 4L, 4L, 4L, 4L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), year = c(2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 2015L, 
2015L, 2015L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 2014L, 
2014L, 2014L, 2014L, 2014L, 2014L), timing = structure(c(1L, 
3L, 2L, 1L, 1L, 2L, 3L, 2L, 2L, 2L, 2L, 1L, 1L, 2L, 3L, 1L, 1L, 
3L, 2L, 3L, 1L, 3L, 2L, 3L, 1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 
1L, 3L, 3L, 2L, 2L, 1L, 2L, 3L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 
2L, 2L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 3L, 2L, 1L, 1L, 2L, 3L, 2L, 
2L, 2L, 2L, 1L, 1L, 2L, 3L, 1L, 1L, 3L, 2L, 3L, 1L, 3L, 2L, 3L, 
1L, 1L, 1L, 2L, 1L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 2L, 2L, 1L, 2L, 
3L, 2L, 1L, 1L, 2L, 2L, 3L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 1L, 3L, 
1L), .Label = c("early", "late", "never"), class = "factor"), 
intensity = structure(c(2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 
1L, 2L, 1L, 1L, 2L, 3L, 2L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 
1L, 2L, 1L, 1L, 2L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 1L, 1L, 
1L, 3L, 1L, 1L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 
2L, 1L, 3L, 2L, 3L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 1L, 2L, 1L, 
1L, 2L, 3L, 2L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 3L, 1L, 2L, 1L, 
1L, 2L, 2L, 2L, 2L, 1L, 2L, 3L, 3L, 2L, 1L, 1L, 1L, 3L, 1L, 
1L, 2L, 2L, 1L, 3L, 2L, 1L, 2L, 1L, 1L, 2L, 2L, 1L, 3L, 2L
), .Label = c("high", "low", "zero"), class = "factor"), 
treatment = structure(c(3L, 1L, 4L, 3L, 2L, 5L, 1L, 4L, 5L, 
4L, 5L, 2L, 2L, 5L, 1L, 3L, 2L, 1L, 4L, 1L, 2L, 1L, 4L, 1L, 
2L, 3L, 2L, 4L, 3L, 5L, 5L, 3L, 2L, 3L, 1L, 1L, 5L, 4L, 2L, 
4L, 1L, 4L, 2L, 3L, 5L, 4L, 1L, 3L, 4L, 5L, 4L, 2L, 3L, 5L, 
3L, 2L, 1L, 3L, 1L, 4L, 3L, 2L, 5L, 1L, 4L, 5L, 4L, 5L, 2L, 
2L, 5L, 1L, 3L, 2L, 1L, 4L, 1L, 2L, 1L, 4L, 1L, 2L, 3L, 2L, 
4L, 3L, 5L, 5L, 3L, 2L, 3L, 1L, 1L, 5L, 4L, 2L, 4L, 1L, 4L, 
2L, 3L, 5L, 4L, 1L, 3L, 4L, 5L, 4L, 4L, 3L, 5L, 2L, 1L, 3L
), .Label = c("control", "early-high", "early-low", "late-high", 
"late-low"), class = "factor"), plant.leaf.g = c(846.216, 
382.704, 2393.088, 61.832, 1315.86, 275.816, 3705.862, 3500.52, 
67.482, 432, 487.492, 1228.618, 776.16, 1575, 735.9, 2417.75, 
1342.92, 2359.046, 686.726, 1385.856, 343.684, 2277.312, 
465.528, 2314.584, 508.4, 1243.644, 1064.448, 1020.646, NA, 
494.832, 1318.248, 1516.4, 1271.218, 512.512, 157.878, 3753.992, 
586.032, 1042.176, 889.632, 651.052, 498.042, 625.872, 16.28, 
497.51, 593.75, 706.84, 2238.742, 232.584, 671.532, 90.72, 
1412.442, 902.728, 3077.184, 619.106, 0.576, 400.452, 684.522, 
849.852, 152.76, 1280.448, 274.47, 387.614, 98.496, 2304.504, 
644.952, 35.392, 250.56, 267.33, 2212.08, 2392.596, 751.944, 
629.418, 731.544, 1013.196, 1516.4, 130.536, 2910.6, 554.4, 
2163.35, 223.86, 2369.376, 551.976, 985.6, 1482.24, 815.386, 
1664.132, 596.376, 1581.432, 217.128, 1041.656, 951.168, 
256.172, 1587.148, 359.448, 546.48, 1226.544, 371.64, 293.504, 
177.726, 343.26, 691.24, 207.604, 588.924, 1405.258, 136.17, 
451.432, 576.18, 424.804, 884.534, 2466.45, 1524.432, 973.208, 
369.474, 410.048)), .Names = c("id", "quad", "year", "timing", 
"intensity", "treatment", "plant.leaf.g"), class = "data.frame", row.names = c(NA, 
-114L))

PS. I could not for the life of me get lsmeans to work with this unbalanced design. A lot of NAs are reported in the output.

If `lsmeans` is reporting `NA`s, that means the results you are asking for are not uniquely estimable; and that in turn is usually due to having no data at some factor combinations. You may have better luck with it if you leave out some interactions in the model. For starters, you might try fitting the model `log(plant.leaf.g)~(timing+intensity+year)^2+(1|id)` which omits the three-way interaction. — Russ Lenth, Aug 28 '16 at 02:51
Looking more closely at your data listing, it appears there is only one `intensity` value for each combination of `year` and `timing`. This means you have linear dependence among these three factors. You should leave one of them completely out of the model. I suggest omitting `year` since it isn't a primary factor in the comparisons you are doing. — Russ Lenth, Aug 28 '16 at 02:58
Using the code `log(plant.leaf.g)~(timing+intensity+year)^2+(1|id)`, I still get the warning "fixed-effect model matrix is rank deficient so dropping 5 columns / coefficients" and `lsmeans` only produces `NA`s. Also, I will edit my data to better represent my design, because it is a subset that I shared so doesn't represent the balance of my treatments. Timing and intensity are fully crossed while the control isn't. @rvl — Angela P, Aug 28 '16 at 17:58
Unfortunately, removing year didn't solve the problem, I've updated my design to show that there are 2 combinations of intensity per combination of year and timing. Thanks for pointing that out! — Angela P, Aug 28 '16 at 18:59
I'll just repeat that the real problem isn't the software, it's the fact that your model can't be used to estimate the comparisons you want. — Russ Lenth, Aug 28 '16 at 20:23

Russ Lenth · Answer 1 · 2016-08-29T02:18:38.003

Because I did not read your question carefully, I didn't realize you hadn't tried lsmeans on a model where you used treatment in place of timing*intensity (for some reason, the dataset you provided has a different name, and treatment instead of SHD). If you do, it works just fine:

> m3<-lmer(log(plant.leaf.g) ~ treatment+year+(1|id), data=data)
> library(lsmeans)
> lsmeans(m3, "treatment", type = "response")
Loading required namespace: pbkrtest
 treatment   response       SE    df lower.CL  upper.CL
 control    1017.7290 289.1544 62.29 576.7671 1795.8244
 early-high  909.2335 260.3904 68.44 513.4725 1610.0288
 early-low   388.1875 116.3790 65.92 213.3433  706.3242
 late-high   626.5379 176.6823 56.61 356.1791 1102.1134
 late-low    393.3225 125.4142 51.60 207.4053  745.8947

Confidence level used: 0.95 
Intervals are back-transformed from the log scale 

> pairs(.Last.value, type = "response")
 contrast               response.ratio        SE    df t.ratio p.value
 control - early-high        1.1193264 0.4457451 72.14   0.283  0.9986
 control - early-low         2.6217461 1.0685111 71.26   2.365  0.1371
 control - late-high         1.6243693 0.6501965 59.75   1.212  0.7445
 control - late-low          2.5875182 1.1050617 56.07   2.226  0.1853
 early-high - early-low      2.3422535 0.9581385 74.29   2.081  0.2394
 early-high - late-high      1.4512026 0.5762732 68.49   0.938  0.8811
 early-high - late-low       2.3116745 0.9907174 58.53   1.955  0.3006
 early-low - late-high       0.6195754 0.2549274 61.77  -1.163  0.7719
 early-low - late-low        0.9869446 0.4319594 57.88  -0.030  1.0000
 late-high - late-low        1.5929371 0.6780835 53.73   1.094  0.8090

P value adjustment: tukey method for comparing a family of 5 estimates 
Tests are performed on the log scale

Now, a bit on your original question. We see that the smallest P value above is about .13, whereas

> library(car)
> Anova(m3)
Analysis of Deviance Table (Type II Wald chisquare tests)

Response: log(plant.leaf.g)
           Chisq Df Pr(>Chisq)
treatment 9.5752  4    0.04823
year      0.1147  1    0.73484

so that the ANOVA test of treatment has a P value of about .05. An ANOVA $F$ test being barely significant is equivalent to saying that some contrast among the levels of treatment is barely significant based on the Scheffe critical value. If that contrast happens to be a pairwise comparison, or nearly one, then that pairwise comparison will be significant. But that isn't the case here. The first and 2nd means are much higher than the 3rd and 5th, and that leads me to this result for a contrast that is pretty significant:

> contrast(lsmeans(m3, "treatment"), list(my.con = c(1, 1, -1, 0, -1)))
 contrast estimate        SE    df t.ratio p.value
 my.con   1.801813 0.5910685 64.56   3.048  0.0033

Results are given on the log (not the response) scale. 
Tests are performed on the log scale

Another contrast that comes out somewhat significant is a comparison of high and low intensity:

> contrast(lsmeans(m3, "treatment"), list(hi.lo = c(0, 1, -1, 1, -1)/2))
 contrast  estimate        SE    df t.ratio p.value
 hi.lo    0.6583465 0.2967584 60.45   2.218  0.0303

Remember, ANOVA tests don't guarantee much of anything about how the pairwise comparisons come out.

Thank you, that makes sense! Just curious about what the 0, 1 and -1 comes from in your code `list(hi.lo = c(0, 1, -1, 1, -1)/2)` and why you divide by 2? — Angela P, Aug 29 '16 at 15:39

score 0 · Answer 2 · answered Aug 29 '16 at 17:16

Another shot at this. You the OP know this, but just to make it clear to others, here's a look at how these factors relate:

R> with(data, table(timing, intensity, year))
, , year = 2014

       intensity
timing  high low zero
  early   11  11    0
  late    13  10    0
  never    0   0   12

, , year = 2015

       intensity
timing  high low zero
  early   12  11    0
  late    12  10    0
  never    0   0   12

Notice that timing = "never" with intensity = "zero" is a special control condition, and only the other levels of these factors appear in combination. That's why models treating them as separate factors lead to difficulties in interpretation. The factor treatment is included in the dataset, having as levels the 5 combinations that actually occur.

Look at some models (I looked at residual plots and I think the square-root transform is best):

R> library("lme4")
R> m3 = lmer(sqrt(plant.leaf.g) ~ treatment + year + (1|id), data=data)
R> m4 = lmer(sqrt(plant.leaf.g) ~ treatment * year + (1|id), data=data)
Warning message:
Some predictor variables are on very different scales: consider rescaling

Compare these two models:

R> anova(m3, m4)
refitting model(s) with ML (instead of REML)
Data: data
Models:
m3: sqrt(plant.leaf.g) ~ treatment + year + (1 | id)
m4: sqrt(plant.leaf.g) ~ treatment * year + (1 | id)
   Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)
m3  8 876.92 898.74 -430.46   860.92                         
m4 12 872.14 904.87 -424.07   848.14 12.783      4    0.01238

It appears to be worthwhile to include the interaction with year despite the warning message.

To interpret this model, the ANOVA table is of limited use. It's more informative to look at what the model predicts and do appropriate comparisons. These predictions are call "least-squares means". We'll compute them separately for each year (because of the interaction with year):

R> library("lsmeans")
R> (m4.lsm = lsmeans(m4, ~ treatment | year, at = list(year = c(2014, 2015)), type = "response"))
year = 2014:
 treatment   response       SE    df lower.CL  upper.CL
 control    1082.1190 224.6040 91.46 681.9805 1574.2165
 early-high 1149.8090 236.6617 97.92 728.1153 1667.4202
 early-low   647.5407 180.8791 92.57 338.1453 1056.5696
 late-high   490.0813 148.4696 82.72 239.2544  829.8841
 late-low    485.0953 171.6529 73.61 203.3380  887.4499

year = 2015:
 treatment   response       SE    df lower.CL  upper.CL
 control    1393.5241 254.9350 91.35 933.1535 1945.8958
 early-high  831.3529 192.9245 97.47 492.5582 1258.3147
 early-low   746.0977 199.8967 96.30 402.0731 1195.6255
 late-high  1050.4552 225.8614 83.42 649.2805 1547.6725
 late-low    520.3968 177.7890 73.61 226.4119  934.9789

Confidence level used: 0.95 
Intervals are back-transformed from the sqrt scale

Finally we can do comparisons among them. It's possible to do all pairwise comparisons, but perhaps what's more informative to me is to construct contrasts that comprise an interpretable breakdown of the 4 d.f. for treatment:

R> trt.con = data.frame(
+     timing = c(0, 1, 1, -1, -1)/2,
+     intensity = c(0, 1, -1, 1, -1)/2,
+     tim.int = c(0, 1, -1, -1, 1)/2,
+     ctl.vs.trt = c(4, -1, -1, -1, -1)/4,
+     row.names = levels(data$treatment))
R> trt.con
           timing intensity tim.int ctl.vs.trt
control       0.0       0.0     0.0       1.00
early-high    0.5       0.5     0.5      -0.25
early-low     0.5      -0.5    -0.5      -0.25
late-high    -0.5       0.5    -0.5      -0.25
late-low     -0.5      -0.5     0.5      -0.25

These comprise coefficients to be applied to the treatment least-square means. For example, the one for timing compares the average of the two early timings with the average of the two late timings. The third column is an interaction, and the fourth compares the control condition with the average of the other four. Now we can test these contrasts:

R> contrast(m4.lsm, trt.con)
year = 2014:
 contrast     estimate       SE    df t.ratio p.value
 timing      7.5964984 3.586509 86.29   2.118  0.0370
 intensity   4.2874559 3.569964 87.92   1.201  0.2330
 tim.int     4.1745568 3.508331 94.05   1.190  0.2371
 ctl.vs.trt  7.0159980 3.800634 95.97   1.846  0.0680

year = 2015:
 contrast     estimate       SE    df t.ratio p.value
 timing      0.4625233 3.606278 87.74   0.128  0.8982
 intensity   5.5584603 3.596573 88.42   1.545  0.1258
 tim.int    -4.0400583 3.529456 94.91  -1.145  0.2552
 ctl.vs.trt  9.4872065 3.810418 95.75   2.490  0.0145

An interesting result is that there is some evidence that the control condition differs from treatments in both years, and that the timing contrast is significant only in 2014.

(I realize this has become for of a CrossValidated-style answer than a StackExchange one; but ultimately, doing something meaningful trumps merely getting programs to run.)

This is a really interesting way to analyze the data, probably the better way in terms of the design I have. However, it's important for me to be able to compare the different clipping times to the control, and similarly for the different clipping intensities. PS I really appreciate your detailed breakdown of everything- really cool! — Angela P, Aug 29 '16 at 21:58
I understand. You might want to replace that last contrast with a few others, e.g., c(1, -.5, -.5, 0, 0), c(1, 0, 0, -.5, -.5), etc. — Russ Lenth, Aug 29 '16 at 22:57

R-multiple comparison with unbalanced design output differs from Anova

2 Answers2