5

I'm trying to run a logistic regression in statsmodels on a large design matrix (~200 columns). The features include a number of interactions, categorical features and semi-sparse (70%) integer features. Although my design matrix is not actually ill-conditioned, it seems to be somewhat close (according to numpy.linalg.matrix_rank, it is full-rank with tol=1e-3 but not with tol=1e-2). As a result, I'm struggling to get logistic regression to converge with any of the methods in statsmodels. Here's what I've tried so far:

  • method='newton': Did not converge after 1000 iterations; raised a singular matrix LinAlgError while trying to invert the Hessian.

  • method='bfgs': Warned of possible precision loss. Claimed convergence after 0 iterations, obviously had not actually converged.

  • method='nm': Claimed that it had converged, but model had a negative pseudo-R-squared and many coefficients were still zero (and very different from values they had converged to with better-conditioned submodels). I tried cranking down xtol to 1e-8 to no avail.

  • fit_regularized(method='l1'): reported Inequality constraints incompatible (Exit mode 4). Then raised a singular matrix LinAlgError while trying to compute the restricted Hessian inverse.

Ben Kuhn
  • 1,059
  • 2
  • 10
  • 25
  • Can you share your data somewhere? – jseabold Dec 12 '14 at 03:46
  • Alas, no; it's proprietary. – Ben Kuhn Dec 12 '14 at 04:45
  • 1
    I found that standardizing the data helped with the convergence issues. This is an ok solution; I can't use formulas with it (because centering each column in the formula is a pain) and it makes the coefficients harder to interpret, but it at least gets it to converge. – Ben Kuhn Dec 12 '14 at 04:49
  • The above "solution" also still failed when I added an 18-level categorical feature. There's some chance that this was due to actual collinearity, although I doubt it. I'll try to create example (random) data that exhibits the problem tomorrow. – Ben Kuhn Dec 12 '14 at 04:51
  • 1
    Not wholly surprised. We have some code to do this internally but it's not hooked up by default yet. – jseabold Dec 12 '14 at 04:51
  • Ah, excellent! That would make life a lot easier. Thanks for your excellent work on statsmodels--I know I'm asking a lot of it! – Ben Kuhn Dec 12 '14 at 04:52
  • Do you use the parameters of the smaller model as starting values for the larger model when you add variables/terms? – Josef Dec 12 '14 at 14:06
  • It would be helpful if you have a ready example that fails and standardization makes it work. We could add it to the documentation. – jseabold Dec 15 '14 at 00:20
  • I tried to reproduce the problem with publicly-available code and data. I didn't get all the way there in the time I allotted, but I at least managed to reproduce some of the parts of it [here](https://dl.dropboxusercontent.com/u/1439158/Statsmodels%20test%20case.ipynb). Namely, I found that when using a spline basis the matrix appeared to be full-rank only for very low tolerances, and that logistic regression appeared to converge but gave meaningless confidence intervals and p-values. However, standardization didn't make this one work any better. – Ben Kuhn Dec 15 '14 at 21:00

0 Answers0