Limit on the Number of Variables for bestglm Function

Question

I am trying to run the bestglm function in R for subset selection and the run fails immediately if I use more than 15 variables in the function. I attached some sample code below (I know these models have far too many variables for this dataset, I am just including these models here as an example):

cars.df = data.frame(mtcars)
cars.df
resp.var = cars.df$mpg

ind.matrix.15 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear, data = cars.df)[, -1]
matrix.xy.15 = data.frame(ind.matrix.15, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.15, family = gaussian(link = 'log'), nvmax = 15)

ind.matrix.16 = model.matrix(mpg ~ disp + hp + drat + wt + qsec + vs + am + gear + carb + disp:wt + drat:wt + qsec:am + gear:hp + cyl:disp + drat:gear + disp:hp, data = cars.df)[, -1]
matrix.xy.16 = data.frame(ind.matrix.16, y = as.matrix(resp.var))
bestglm(Xy = matrix.xy.16, family = gaussian(link = 'log'), nvmax = 16)

The first bestglm function runs fine, but when I add an additional variable for a total of 16 features, the second bestglm function instantly produces this error message: p = 16. must be <= 15 for GLM.

Changing the method argument to a simpler algorithm such as backward rather than the default exhaustive does not make the error go away.

Is this just a limitation of the bestglm function, or is there an argument I can change to allow more than 15 features.

The error message indicates you can fit 15 terms to `bestglm` ([here](https://github.com/cran/bestglm/blob/9a9e26a7fc41dcb3049f0a23d31812faea38e8e2/R/bestglm.R#L196)). I wonder what would changing `RequireFullEnumerationQ` do... — Roman Luštrik, Nov 15 '20 at 21:16
Interesting, thank you for linking the code. It looks like the code defaults to the leaps library if the `family = gaussian`, there are no factors with more than 2 levels, and `RequireFullEnumerationQ = FALSE` (default). If any of those are false, it looks like the function requires 15 or less variables, regardless if `method != exhaustive`. This is rather unfortunate, but I guess it is just a limitation of the function. — eschu, Nov 15 '20 at 22:07

Ben Bolker · Accepted Answer · 2020-11-15T22:15:16.170

As @RomanLuštrik says, this is a hard-coded constraint in bestglm, presumably because 15 predictors means there are 2^15 = 32768 candidate models, and one has to stop somewhere ... as far as I can see there is no way around this constraint when running a GLM. (Roman's suggestion of RequireFullEnumerationQ=FALSE doesn't work, because the leaps-and-bound algorithm is only available for linear models, not GLMs.)

One possible strategy (not fully explored here) would be to fit the linear model exhaustively with leaps-and-bounds, save a large number of the top models (say TopModels=1000) and then re-evaluate the top models with your preferred variance structure ... this doesn't work directly in leaps, but can be hacked as follows:

leaps.obj <- leaps:::leaps.setup(matrix.xy.16,y=cars.df$mpg,nvmax=16,
       nbest=10000)
bb <- leaps:::leaps.exhaustive(leaps.obj, really.big=TRUE)

but I don't know (and it seems like a lot of work) to figure out how to re-evaluate these models with a log-link Gaussian.

You might be able to get the glmulti package to work (it offers both method="h" for full enumeration and method="g" for a genetic algorithm), but so far I haven't managed to overcome some Java errors ...

Unfortunately, the J Stat Software article describing glmulti shows that this method has some of the same constraints:

For performance, the Java classes encode formulas as compact bit strings. Currently two integers (32 bits each) are used for main effects, and two long integers (128 bits) are used for each category of interaction terms (factor:factor, covariate:covariate, and factor:covariate),to encode models. This means that there can be at most 32 factors and 32 covariates, and, if including interactions, at most 128 interactions of each category. The latter constraint necessitates that, if x is the number of factors and y the number of covariates:
x <16
y <16
xy <128

Limit on the Number of Variables for bestglm Function

1 Answers1