Remove data points with too much leverage on gam fit

Question

I talked to a statistician to ask him, how I can identify and remove points that have too much leverage on gam (Generalized Additive Model) fits.

He told me that I can do this based on the influence/projection/hat matrix. I have also seen that @Gavin Simpson had the same idea.

Unfortunately, I do not know how to do this in practice in r. I can extract the influence matrix of a gam via the influence.gam function, but then I don’t know how to make the connection between the influence matrix and the raw data to know which raw data should be removed.

Does anyone know how to remove data points with too much leverage based on the influence matrix? Is there a function that works for gams that have been fitted by gamm4?

Example code:

library(mgcv)
set.seed(11)
x1 = c(100, rnorm(100,5,1))
x2 = c(runif(100,0,100),300)
y  = x1 * x2 * rnorm(101, 50,5)
d1 = data.frame(y,x1,x2)

mod = gam(y ~ x1*x2, data = d1)

inf = influence.gam(mod)
hist(inf)

EDIT: Thanks for your answer 李哲源 Zheyuan Li. I realized I totally forgot to include s smooths. I still don't really understand what the influence.gam actually returns? Are those cook's distances or leverages or none of the two? Is it proper to delete all values above .5? And is it proper to do the same procedure with a gamm object (influence.gam(gamm_model$gam))?

mod1 = gam(y ~ s(x1) + s(x2), data = d1)

inf1 = influence.gam(mod1)
hist(inf1)
any(inf1<0) # At least in this example all values are in between 0-1

mod2 = gam(y ~ s(x1, k = 8, fx = TRUE) + s(x2, k = 3, fx = TRUE), data = d1)
summary(mod2)

inf2 = influence.gam(mod2)
hist(inf2)  
d1$inf2 = inf2

d2 = subset(d1, inf2 < 0.5)

mod3 = gam(y ~ s(x1)+s(x2), data = d2)
summary(mod3)
plot(mod3)

Thanks for your answer and sorry that I kept quiet for so long. I still don't understand how to make the connection between the influence matrix and the raw data, but if I understand you correctly, I don't need to understand this, because I should anyway not remove points with too much leverage when plotting a gamm, right? — bee guy, May 17 '17 at 12:36

score 4 · Answer 1 · answered May 17 '17 at 13:05

Have you read this thread on Cross Validated https://stats.stackexchange.com/q/65912/117783? Apart from the two answers, read gung's comment under the question, too. Leverage, cook's distance and outliers are well explained, for linear regression. The same holds for GLM. Once you have a rough understanding of these standard theory, let's see GAM.

The major difficulty is that the hat matrix of a GAM has different property, so (at least) I would not believe that leverage and cook's distance is meaningful for outlier detection in GAM.

Why? Unlike linear model and GLM, leverages of GAM, or a penalized GLM, are not necessarily between [0, 1]. It can be negative. And each row / column of the hat matrix does not sum up to 1 even if there is an intercept in the model. Interpretation of such leverages is difficult. Conventionally we want Cook's distance to pick up outliers. However, with a negative leverage you get a negative Cook's distance. What does that imply?

I would say that there is no real "outlier" for a nonparametric regression. As smoothing parameters go to 0, you can get closer and closer to observations. In other words, the conclusion on what an outlier depends on smoothing parameters!

However, if you fit a GAM with all smoothing parameters fixed at 0, then the model becomes a linear model or a GLM. Outlier detection based on such fit is reasonable.

If I were you, I may do the following.

fit a GAM with all smoothing parameters estimated by mgcv.
check the effective degree of freedom (edf) for each smooth in the resulting model. Round them up to the nearest integer. For example, suppose you have a smooth term s(x) with edf being 13.2, then we round it to 14.
fit a new GAM without penalization, by setting fx = TRUE in all s() or te(). However, we now want to set k, the basis dimension to be the integers in the last step, plus one! Taking the example above, we want s(x, k = 15, fx = TRUE).
the above model is OK for detection of outliers.

That's a very useful answer, thanks for sharing. The question remains though: if we run cooks.distance(fit) where fit is a gam() model, we get a vector of values. Are these reliable values? — Angelo, Apr 05 '21 at 15:59

Remove data points with too much leverage on gam fit

1 Answers1

Linked