5

My dataset has many redundant observations (but each observation should be counted). So I consider using 'weights' option in GAM because it significantly reduces computation time.

gam function (in mgcv package) explains that they are 'equivalent' (from ?gam on arguments weights):

"Note that a weight of 2, for example, is equivalent to having made exactly the same observation twice."

But it does not seem right.

yy = c(5,2,8,9)
xx = 1:4
wgts = c(3,2,4,1)
yy2 = rep(yy, wgts)
xx2 = rep(xx, wgts)
mod1 = gam(yy2 ~ xx2)
mod2 = gam(yy ~ xx, weights = wgts)
mod3 = gam(yy ~ xx, weights = wgts / mean(wgts))

predict(mod1,data.frame(xx2=1:4))
predict(mod2,data.frame(xx=1:4))
predict(mod3,data.frame(xx=1:4))

The estimates are identical in all three models. Standard error are same in model 2 and 3 but different in model 1. GCV is different in all three models.

I understand GCVs can be different. But how can we say that the models are identical if standard errors are different? Is this an error, or is there any good explanation for this?

Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
user67275
  • 1
  • 9
  • 38
  • 64

1 Answers1

7

The issues you saw is not about GAM. You have used gam to fit a parametric model, in which case gam behaves almost as same as lm. To answer your questions, it is sufficient to focus on the linear regression case. What happens to a linear model will happens to GLMs and GAMs, too. Here is how we can reproduce the issue with lm:

yy <- c(5,2,8,9)
xx <- 1:4
wgts <- c(3,2,4,1)
yy2 <- rep(yy,wgts)
xx2 <- rep(xx,wgts)
fit1 <- lm(yy2 ~ xx2)
fit2 <- lm(yy ~ xx, weights = wgts)
fit3 <- lm(yy ~ xx, weights = wgts/mean(wgts))
summary1 <- summary(fit1)
summary2 <- summary(fit2)
summary3 <- summary(fit3)
pred1 <- predict(fit1, list(xx2 = xx), interval = "confidence", se.fit = TRUE)
pred2 <- predict(fit2, list(xx = xx), interval = "confidence", se.fit = TRUE)
pred3 <- predict(fit3, list(xx = xx), interval = "confidence", se.fit = TRUE)

All models have the same regression coefficients, but other results may differ. You asked:

  1. For weighted regression fit2 and fit3, why is almost everything the same except residual standard error?
  2. Why is weighted regression (fit2 or fit3) not equivalent to ordinary regression with ties?

Your first question is about the scaling invariance of weight least squares to weights. Here is a brief summary I made:

enter image description here

If we rescale W by an arbitrary positive value, only residual standard error and unscaled covariance will change. Such change does not imply a different, non-equivalent model. In fact, everything related to prediction is not affected. In weighted regression, don't just look at sigma2; it is just a marginal variance. What is really of interest is the gross variance after multiplying weights. If you divide your weights by 2, you will find sigma2 doubles, but you still get the same result when multiplying them together.

summary2$coef
summary3$coef

#            Estimate Std. Error   t value  Pr(>|t|)
#(Intercept) 2.128713   3.128697 0.6803832 0.5664609
#xx          1.683168   1.246503 1.3503125 0.3094222

pred2
pred3

#$fit
#       fit        lwr      upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218  0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
#       1        2        3        4 
#2.048213 1.307343 1.526648 2.461646 
#
#$df
#[1] 2
#
#$residual.scale  ## for `pred2`
#[1] 3.961448
#
#$residual.scale  ## for `pred3`
#[1] 2.50544

Your second question is about the meaning of weights. Weights is used to model heteroscedastic response to overcome leverage effect in ordinary least square regression. Weights are proportional to reciprocal variance: You give bigger weights to data with smaller expected errors. Weights can be non-integer, so it does not have a naturual explanation in terms of repeated data. Thus, what is written in mgcv package is not rigorously correct.

The real difference between fit1 and fit2? is the degree of freedom. Check the above table for (n - p). n is the number of data you have, while p is the number of non-NA coefficients, so n - p is the residual degree of freedom. For both models we have p = 2 (intercept and slope), but for fit1 we have n = 10 while for fit2 we have n = 4. This has dramatic effect on inference, as now standard errors for coefficients and predictions (hence confidence intervals) will differ. These two models are far from being equivalent.

summary1$coef
#            Estimate Std. Error  t value   Pr(>|t|)
#(Intercept) 2.128713  1.5643486 1.360766 0.21068210
#xx2         1.683168  0.6232514 2.700625 0.02704784

summary2$coef

#            Estimate Std. Error   t value  Pr(>|t|)
#(Intercept) 2.128713   3.128697 0.6803832 0.5664609
#xx          1.683168   1.246503 1.3503125 0.3094222

pred1

#$fit
#       fit      lwr       upr
#1 3.811881 1.450287  6.173475
#2 5.495050 3.987680  7.002419
#3 7.178218 5.417990  8.938446
#4 8.861386 6.023103 11.699669
#
#$se.fit
#        1         2         3         4 
#1.0241066 0.6536716 0.7633240 1.2308229 
#
#$df    # note, this is `10 - 2 = 8`
#[1] 8
#
#$residual.scale
#[1] 1.980724

pred2

#$fit
#       fit        lwr      upr
#1 3.811881 -5.0008685 12.62463
#2 5.495050 -0.1299942 11.12009
#3 7.178218  0.6095820 13.74685
#4 8.861386 -1.7302209 19.45299
#
#$se.fit
#       1        2        3        4 
#2.048213 1.307343 1.526648 2.461646 
#
#$df    # note, this is `4 - 2 = 2`
#[1] 2
#
#$residual.scale  ## for `pred2`
#[1] 3.961448
Zheyuan Li
  • 71,365
  • 17
  • 180
  • 248
  • 1
    What a full-service answer! Very thorough. – Gregor Thomas Oct 18 '16 at 22:43
  • 1
    First of all, GAM and linear regression work totally differently. So explaining with 'lm' may be inappropriate. Linear regression (or weighted linear regression) finds coefficients by matrix calculation, (X'X)^(-1)X'y (X'WX)^(-1)X'Wy). On the other hand, GAM inference doesn't involve such matrix calcuation. They use maximum likelihood method. GAM manual also says that the 'weights' option is "prior weights on the contribution of the data to the log likelihood." – user67275 Oct 21 '16 at 01:43
  • 1
    In addition, I'm not sure 'weights' in GAM is intended to be used for dealing with heteroscedasticity. As 'gam' manual says, weights option can be used in the situation where "having made exactly the same observation twice.", and I wondered why their results (tied observation vs using weights as frequcney) are not identical. – user67275 Oct 21 '16 at 01:46
  • I think I intended to use 's(xx)' instead of 'xx' as a regressor, but forgot it. Then I think the description "Note that a weight of 2, for example, is equivalent to having made exactly the same observation twice." in the GAM manual is not exact.. right? – user67275 Oct 21 '16 at 01:52