Statsmodels logistic regression convergence problems

Question

I'm trying to run a logistic regression in statsmodels on a large design matrix (~200 columns). The features include a number of interactions, categorical features and semi-sparse (70%) integer features. Although my design matrix is not actually ill-conditioned, it seems to be somewhat close (according to numpy.linalg.matrix_rank, it is full-rank with tol=1e-3 but not with tol=1e-2). As a result, I'm struggling to get logistic regression to converge with any of the methods in statsmodels. Here's what I've tried so far:

method='newton': Did not converge after 1000 iterations; raised a singular matrix LinAlgError while trying to invert the Hessian.
method='bfgs': Warned of possible precision loss. Claimed convergence after 0 iterations, obviously had not actually converged.
method='nm': Claimed that it had converged, but model had a negative pseudo-R-squared and many coefficients were still zero (and very different from values they had converged to with better-conditioned submodels). I tried cranking down xtol to 1e-8 to no avail.
fit_regularized(method='l1'): reported Inequality constraints incompatible (Exit mode 4). Then raised a singular matrix LinAlgError while trying to compute the restricted Hessian inverse.

I found that standardizing the data helped with the convergence issues. This is an ok solution; I can't use formulas with it (because centering each column in the formula is a pain) and it makes the coefficients harder to interpret, but it at least gets it to converge. — Ben Kuhn, Dec 12 '14 at 04:49
The above "solution" also still failed when I added an 18-level categorical feature. There's some chance that this was due to actual collinearity, although I doubt it. I'll try to create example (random) data that exhibits the problem tomorrow. — Ben Kuhn, Dec 12 '14 at 04:51
Not wholly surprised. We have some code to do this internally but it's not hooked up by default yet. — jseabold, Dec 12 '14 at 04:51
Ah, excellent! That would make life a lot easier. Thanks for your excellent work on statsmodels--I know I'm asking a lot of it! — Ben Kuhn, Dec 12 '14 at 04:52
Do you use the parameters of the smaller model as starting values for the larger model when you add variables/terms? — Josef, Dec 12 '14 at 14:06
It would be helpful if you have a ready example that fails and standardization makes it work. We could add it to the documentation. — jseabold, Dec 15 '14 at 00:20
I tried to reproduce the problem with publicly-available code and data. I didn't get all the way there in the time I allotted, but I at least managed to reproduce some of the parts of it [here](https://dl.dropboxusercontent.com/u/1439158/Statsmodels%20test%20case.ipynb). Namely, I found that when using a spline basis the matrix appeared to be full-rank only for very low tolerances, and that logistic regression appeared to converge but gave meaningless confidence intervals and p-values. However, standardization didn't make this one work any better. — Ben Kuhn, Dec 15 '14 at 21:00

Statsmodels logistic regression convergence problems

0 Answers0

Linked