I would start by saying immediately my background in statistics is extremely basic (even though I'm working on it) and for some work-related reason I have to handle quite big gam
models that have both smoothed factors as well as categorical factors. The gam regression is performed in R on a data set that has about 50,000 rows.
My goal is to identify outliers since we have about 50,000 new data points every day and it's impossible to identify outliers manually.
I can't switch to a different model, so I can't consider suggestions pointing to different solutions. Now, my question is simple: how can I identify outliers? This is for sure a huge topic, even though I recently came across Cook's distance and influence.gam
that seem to point me to the right direction.
I read this useful post: Remove data points with too much leverage on gam fit as well as https://stats.stackexchange.com/questions/22161/how-to-read-cooks-distance-plots/22171#22171 Now, my real question is: given a fitted gam model, can I anyhow rely on cooks.distance(fit) and on influence.gam(fit) in order to spot outliers?
Let's say we have:
library(mgcv)
set.seed(11)
x1 = c(100, rnorm(100,5,1))
x2 = c(runif(100,0,100),300)
y = x1 * x2 * rnorm(101, 50,5)
d1 = data.frame(y,x1,x2)
mod1 = gam(y ~ s(x1) + s(x2), data = d1)
inf1 = influence.gam(mod1)
hist(inf1)
hist(cooks.distance(mod1))
Can I consider those data points with a value > 0.2 at least data points that require further investigation? And what about those values with a cooks distance >= 60 ?
Thank you.