3

I would start by saying immediately my background in statistics is extremely basic (even though I'm working on it) and for some work-related reason I have to handle quite big gam models that have both smoothed factors as well as categorical factors. The gam regression is performed in R on a data set that has about 50,000 rows.

My goal is to identify outliers since we have about 50,000 new data points every day and it's impossible to identify outliers manually. I can't switch to a different model, so I can't consider suggestions pointing to different solutions. Now, my question is simple: how can I identify outliers? This is for sure a huge topic, even though I recently came across Cook's distance and influence.gam that seem to point me to the right direction.

I read this useful post: Remove data points with too much leverage on gam fit as well as https://stats.stackexchange.com/questions/22161/how-to-read-cooks-distance-plots/22171#22171 Now, my real question is: given a fitted gam model, can I anyhow rely on cooks.distance(fit) and on influence.gam(fit) in order to spot outliers?

Let's say we have:

library(mgcv)
set.seed(11)
x1 = c(100, rnorm(100,5,1))
x2 = c(runif(100,0,100),300)
y  = x1 * x2 * rnorm(101, 50,5)
d1 = data.frame(y,x1,x2)
mod1 = gam(y ~ s(x1) + s(x2), data = d1)
inf1 = influence.gam(mod1)
hist(inf1)
hist(cooks.distance(mod1))

Can I consider those data points with a value > 0.2 at least data points that require further investigation? And what about those values with a cooks distance >= 60 ?

Thank you.

Shawn Hemelstrand
  • 2,676
  • 4
  • 17
  • 30
Angelo
  • 1,594
  • 5
  • 17
  • 50
  • As a side note, interesting to notice how in this post https://stats.stackexchange.com/questions/22161/how-to-read-cooks-distance-plots/22171#22171 a person suggested to basically recompute the influence by dropping each row individually. If that is something correct, it would still quite hard for me to perform 50k gam regression given the dimension of my dataframe. – Angelo Apr 05 '21 at 16:37

0 Answers0