effect vs. predict function

Question

I'm simultaneously trying to understand R's predict() function and the 'effects' package effect() function. Essentially, I'm running a regression to test the interaction of two dichotomous IVs on a DV while controlling for two continuous covariates. In my actual dataset, the interaction is significant and so now I would like to plot the interaction. Because I have covariates in my model, I should plot the means after controlling for these other variables (i.e. estimated marginal means in SPSS). I haven't done this in R before and while searching I've come to expect I should be able to obtain the values I need for graphing with either the effect() or the predict() functions. Therefore, I tried doing it with each on a randomly generated dataset:

> set.seed(100)
> test <- data.frame(iv1 = factor(round(rnorm(200, mean=.5, sd=.25), 0), levels=c(0,1), labels=c("A","B")), iv2 = factor(round(rnorm(200, mean=.5, sd=.25), 0), levels=c(0,1), labels=c("C","D")), cv1 = rnorm(200, mean=4, sd=1), cv2 = rnorm(200, mean=3, sd=1), dv = rnorm(200, mean=5, sd=1))
> mod <- lm(dv ~ cv1 + cv2 + iv1*iv2, data = test)
> new <- with(test, expand.grid(iv1 = levels(iv1), iv2 = levels(iv2), cv1 = mean(cv1), cv2 = mean(cv2)))
> test$pv <- predict(mod, newdata = new)

> tapply(test$pv, list(test$iv1, test$iv2), mean)
         C        D
A 5.076842 5.086218
B 5.025614 5.065399

> effect("iv1:iv2", mod)

 iv1*iv2 effect
   iv2
iv1        C        D
  A 5.019391 5.167275
  B 5.216955 4.855195

Because I'm getting different results I exported the data to SPSS and ran an ANOVA doing the same thing and looked at the estimated marginal means (EMMEANS). These were identical to the results given by effect() in R.

SPSS syntax:

DATASET ACTIVATE DataSet1.
RECODE iv1 iv2 ('A'=-1) ('B'=1) ('C'=-1) ('D'=1) INTO iv1_recode iv2_recode.
EXECUTE.

UNIANOVA dv BY iv1_recode iv2_recode WITH cv1 cv2
  /METHOD=SSTYPE(3)
  /INTERCEPT=INCLUDE
  /EMMEANS=TABLES(OVERALL) WITH(cv1=MEAN cv2=MEAN) 
  /EMMEANS=TABLES(iv1_recode) WITH(cv1=MEAN cv2=MEAN) 
  /EMMEANS=TABLES(iv2_recode) WITH(cv1=MEAN cv2=MEAN) 
  /EMMEANS=TABLES(iv1_recode*iv2_recode) WITH(cv1=MEAN cv2=MEAN) 
  /PRINT=DESCRIPTIVE
  /CRITERIA=ALPHA(.05)
  /DESIGN=cv1 cv2 iv1_recode iv2_recode iv1_recode*iv2_recode.

As a check, the SPSS output for the EMMEANS says, "Covariates appearing in the model are evaluated at the following values: cv1 = 3.996208827095569, cv2 = 3.052881951477868." These are identical to the values for the covariates that I used with predict:

> new
  iv1 iv2      cv1      cv2
1   A   C 3.996209 3.052882
2   B   C 3.996209 3.052882
3   A   D 3.996209 3.052882
4   B   D 3.996209 3.052882

So what am I failing to understand? Or am I doing something stupid here (a distinct possibility)? This could be me not grasping what an estimated marginal mean is.

Any help is greatly appreciated!

+1 for the nice, reproducible example – MrFlick Jul 22 '14 at 22:25 — MrFlick, Jul 22 '14 at 22:25

score 3 · Accepted Answer · answered Jul 22 '14 at 22:25

So there seems to be a bit of confusion here about getting results for the model itself, and the model applied to the observed data. A big problem occurs here

test$pv <- predict(mod, newdata = new)

Here, new had 4 rows so predict(mod, newdata = new) has for values. Running just that gives

predict(mod, newdata = new)
#        1        2        3        4 
# 5.019391 5.216955 5.167275 4.855195

and notice how these values match the result from effect().

When you assign them to test$pv, that length 4 vector gets recycled so it ends up repeating 50 times along the test data.frame. And test really contains your observed data, so mixing theoretical predictions form the model and the observed data isn't really a super idea. If you actually wanted the "true" predicted value for each observation, then test$pv<-predict(mod) would have been the right choice. However, taking the sums over test, which is again the observed values`, with

tapply(test$pv, list(test$iv1, test$iv2), mean)

would be using the values of cv1 and cv2 that were actually observed, rather than just the overall mean of your covariates.

We've already seen that effect() use the mean of the covariates, but you could also explicitly set values with

effect("iv1:iv2", mod, given.values=c(cv1=3.996209, cv2=3.052882))

if you like.

Ahh, thanks for the help and clear explanation. It seems like a silly oversight but this helps clarify how the predict() function is operating for me and where I was getting confused. — YTD, Jul 22 '14 at 23:36

effect vs. predict function

1 Answers1

Linked