3

I am trying to fit a logistic regression to a small data set (17k rows, 16 columns). But it continues to run after 60+ minutes and I just ended it. Neither my CPU nor my RAM are maxed -- we just observe higher utilization once I start the fitting process. To rule out the possibility of an egregious coding error, I tested the same code with a data set that was 5 rows by 16 columns. It worked -- I was able to get a summary and confints. Hence, there must be another issue.

The data set has a mixture of factor, int and numerical variables. I'd like to share it's schema but it contains sensitive, proprietary information.

I'm wondering if there are some solutions that can be suggested, or if the solutions posited in the half-decade old posts shared below are still relevant (I am trying those old solutions now).

The data set dimensions and the code:

> dim(design_mat_final)
[1] 16812    16

log_model <- glm(label ~., 
                 family = binomial(link = 'logit'),
                 data = design_mat_final)

My session info:

> sessionInfo()
R version 3.4.3 (2017-11-30)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2        dplyr_0.7.4         bit64_0.9-7         bit_1.1-12          data.table_1.10.4-3

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.15     utf8_1.1.3       crayon_1.3.4     assertthat_0.2.0 R6_2.2.2         magrittr_1.5    
 [7] pillar_1.1.0     cli_1.0.0        rlang_0.1.6      tools_3.4.3      glue_1.2.0       yaml_2.1.16     
[13] compiler_3.4.3   pkgconfig_2.0.1  knitr_1.20       bindr_0.1        tibble_1.4.2    

Related to this 5 year old post: How to speed up GLM estimation in r?

and relevant to this 6 year old CrossValidated post: https://stats.stackexchange.com/questions/26965/logistic-regression-is-slow

Update:

I tried speedglm and it did not have an appreciable effect.

user2205916
  • 3,196
  • 11
  • 54
  • 82
  • Is your data numeric or categorical? – HXSP1947 Feb 21 '18 at 05:35
  • both data types are present. – user2205916 Feb 21 '18 at 05:38
  • 1
    My initial thought is to remove the categorical data and see if it speeds up. Categorical data could potentially be slowing things down if there is good deal of dummy encoding going on. In previous work that I have done I was using logistic regresssion on a relatively small data set (10000 samples or so) with 15 predictors. All of our features were categorical though which after dummy encoding gave 150+ features. Things ran a lot slower than expected because of this (large matrices for ordinary least squares) – HXSP1947 Feb 21 '18 at 05:45
  • 1
    To test the above (@HXSP1947), you could try to run `m<-model.matrix(label~., data = design_mat_final)`. Check how long that takes and check the dimensions of the model matrix. – Jan van der Laan Feb 21 '18 at 06:43
  • Run something like `lapply(data, function(x) ifelse(!is.numeric(x), as.numeric(as.factor(x)), x))` to convert columns into numeric. Try a matrix rather than a data frame (which is actually a list), use `as.matrix(data)`. – jay.sf Feb 21 '18 at 10:56
  • 1
    @HXSP1947 that actually ended up being one of what turned out to be primary problem. There are a large number of unbalanced factor variables in my dataset and when these unbalanced categorical variables are turned into dummy variables, there is a high probability of attaining one column that is a linear combination of another. Since the default imputation methods involve linear regression, this resulted in a X matrix that cannot be inverted. – user2205916 Feb 21 '18 at 19:17
  • @HXSP1947 trying a smaller subset of my original `data.table` helped to debug the problem, because I immediately got an error: `system is computationally singular . . .`, which led me to the revelation shared in my comment above. As a side note, I also tried `speedglm` which had no appreciable effect on the original `data.table` I was working with. – user2205916 Feb 21 '18 at 19:18
  • 1
    @user2205916 show the dimensions of `x <- model.matrix( ~ .,data = design_mat_final)`. If the categorical predictors have a large number of values, the information matrix could be nearly singular. Also, have you tried `glm.fit`? – AdamO Feb 21 '18 at 19:51

0 Answers0