I'm trying to estimate a logit using the glm function in R where my data set has about 40,000 observations and where I'm trying to use as a control a factor with about 1,800 levels. It's a data set of mayoral candidates in cities. Is there any way to estimate how long it will take. I stopped it after 10 minutes, but I'm not sure if this will take minutes, hours, days, weeks, or longer to finish. Is there any way to estimate how long it will take?
1 Answers
Converting my comments to an answer:
There's not really a way to pre-compute time... it will depend on a lot of factors, including the computer you're running it on. You could use the control parameters to set trace = TRUE
which will give you output every iteration. The default is a maximum of 25 iterations. So monitoring that as it runs will give you a sense of how quickly things are moving.
You could run your model on increasing subsets of your data to see how it scales. Do 5k rows with 200 levels of your factor. Then 10k rows with 400 levels, etc. Doing this 4 or 5 times should give you a decent sense. Don't expect the growth in time to be linear...
Better use of your time may be finding ways to speed up the estimation. With that many factor levels, a sparse matrix will certainly help out. The fastglm
package looks quite nice (though I've never used it). This question has several answers with ideas for speeding up glm
estimation.

- 136,190
- 20
- 167
- 294