I'm fitting some GAMs with mgcv where I need a high k
value to capture complicated behavior in some cases. But I've noticed that sometimes (some datasets), fitting the GAM takes forever when k
is high, and it seems like in these cases, I also don't get convergence. I need to try fitting lots of datasets and can't wait three days when I stumble on one of these datasets where it can take forever and ultimately fail!
I encountered R.utils::withTimeout()
, which seemed like a promising tool to ensure I move on if I hit one of these gam time traps, but it's acting inconsistently for me. Below is a printout of running the same script three times consecutively. Note that the first of these three times, the timeout apparently did not happen. I vaguely understand that there are situations where withTimeout
is expected to fail... I'd like to know how I can enforce a hard stop on mgcv::gam
, potentially with a different tool (I saw a reference to the package processx).
This looks like a lot but it's exact repetitions; session info below as well. Key point from the code below is that withTimeout
apparently did nothing the first time (took 30 seconds) and then kicks in each successive time (8 seconds seems close enough to 6).
Thanks!
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.418 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
> tictoc::toc()
30.332 sec elapsed
> myk <- 120
>
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.415 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I
> # guarantee 6 seconds.
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.112 sec elapsed
> # Test timeout with mgcv::gam
> myk <- 120
>
> tictoc::tic()
> set.seed(2) ## simulate some data...
> dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
Gu & Wahba 4 term additive model
> b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
> summary(b)
Family: gaussian
Link function: identity
Formula:
y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3,
k = myk)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 7.84735 0.02869 273.5 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x0) 5.057 6.321 87.162 <2e-16 ***
s(x1) 3.739 4.671 813.470 <2e-16 ***
s(x2) 20.716 25.865 354.052 <2e-16 ***
s(x3) 3.298 4.121 1.496 0.197
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.728 Deviance explained = 73%
GCV = 4.1439 Scale est. = 4.1158 n = 5000
> tictoc::toc()
30.318 sec elapsed
> # just did this script twice; this part takes me 30 seconds
>
> # I expect the next part to take 6 seconds based on tmax.
> # dpeending on n and k, it either does, or it can take WAYYY longer. How can I
> # guarantee 6 seconds.
> tmax <- 6
>
> tictoc::tic()
> gf <- tryCatch({
+ res <- R.utils::withTimeout({
+ set.seed(2) ## simulate some data...
+ dat <- mgcv::gamSim(1
+ , n = 5000
+ , dist = "normal"
+ , scale = 2)
+ b <- mgcv::gam(y ~ s(x0, k = myk) + s(x1, k = myk) + s(x2, k = myk) + s(x3, k = myk)
+ , data = dat)
+ summary(b)
+ }, timeout = tmax)
+ }, TimeoutException = function(ex) {
+ message(paste("Timeout before gam fit complete (should take", tmax, "seconds)"))
+
+ })
Gu & Wahba 4 term additive model
Timeout before gam fit complete (should take 6 seconds)
> tictoc::toc()
8.04 sec elapsed
> sessionInfo()
R version 4.2.3 (2023-03-15)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Ventura 13.2.1
Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] Rcpp_1.0.10 compiler_4.2.3 pillar_1.8.1 R.methodsS3_1.8.2 R.utils_2.12.2 tools_4.2.3 mvnfast_0.2.8 digest_0.6.31 evaluate_0.20 lubridate_1.9.2
[11] lifecycle_1.0.3 tibble_3.2.1 nlme_3.1-162 gtable_0.3.2 lattice_0.20-45 timechange_0.2.0 mgcv_1.8-42 pkgconfig_2.0.3 rlang_1.1.0 Matrix_1.5-3
[21] cli_3.6.0 rstudioapi_0.14 patchwork_1.1.2.9000 yaml_2.3.7 parallel_4.2.3 xfun_0.37 fastmap_1.1.1 gratia_0.8.1 knitr_1.42 stringr_1.5.0
[31] furrr_0.3.1 dplyr_1.1.0 generics_0.1.3 vctrs_0.6.0 globals_0.16.2 tictoc_1.1 grid_4.2.3 tidyselect_1.2.0 glue_1.6.2 listenv_0.9.0
[41] R6_2.5.1 fansi_1.0.4 parallelly_1.34.0 rmarkdown_2.20 tidyr_1.3.0 purrr_1.0.1 ggplot2_3.4.1 magrittr_2.0.3 htmltools_0.5.4 scales_1.2.1
[51] codetools_0.2-19 splines_4.2.3 colorspace_2.1-0 future_1.32.0 utf8_1.2.3 stringi_1.7.12 munsell_0.5.0 R.oo_1.25.0