I talked to a statistician to ask him, how I can identify and remove points that have too much leverage on gam (Generalized Additive Model) fits.
He told me that I can do this based on the influence/projection/hat matrix. I have also seen that @Gavin Simpson had the same idea.
Unfortunately, I do not know how to do this in practice in r. I can extract the influence matrix of a gam via the influence.gam
function, but then I don’t know how to make the connection between the influence matrix and the raw data to know which raw data should be removed.
Does anyone know how to remove data points with too much leverage based on the influence matrix? Is there a function that works for gams that have been fitted by gamm4?
Example code:
library(mgcv)
set.seed(11)
x1 = c(100, rnorm(100,5,1))
x2 = c(runif(100,0,100),300)
y = x1 * x2 * rnorm(101, 50,5)
d1 = data.frame(y,x1,x2)
mod = gam(y ~ x1*x2, data = d1)
inf = influence.gam(mod)
hist(inf)
EDIT: Thanks for your answer 李哲源 Zheyuan Li. I realized I totally forgot to include s smooths. I still don't really understand what the influence.gam actually returns? Are those cook's distances or leverages or none of the two? Is it proper to delete all values above .5? And is it proper to do the same procedure with a gamm object (influence.gam(gamm_model$gam))?
mod1 = gam(y ~ s(x1) + s(x2), data = d1)
inf1 = influence.gam(mod1)
hist(inf1)
any(inf1<0) # At least in this example all values are in between 0-1
mod2 = gam(y ~ s(x1, k = 8, fx = TRUE) + s(x2, k = 3, fx = TRUE), data = d1)
summary(mod2)
inf2 = influence.gam(mod2)
hist(inf2)
d1$inf2 = inf2
d2 = subset(d1, inf2 < 0.5)
mod3 = gam(y ~ s(x1)+s(x2), data = d2)
summary(mod3)
plot(mod3)