Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

Question

I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit to fit the parameters to my model.

The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.

I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.

However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.

It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800 covariates and n=4000 observations are used.

Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?

My questions:

Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?

How could I modify my code to work for numbers around p=800 and n=4000?

This question needs both _code_, and _data_ (or means to generate some). That said, I recall a similar question of yours from a few days ago. What is your main objective? Is this curiosity into the asymptotic behaviour of the MLEs obtained via logistic regression, or do you have a specific research problem with this much data, and do you think a logistic regression might be the way to go about it? — Nelewout, Jul 21 '19 at 12:08
@N.Wouda It is indeed just a study into the asymptotic behaviour of logistic regression. I have asked similar questions, but none of them being the same. Previously, I was unsure even which packages would be best to use for maximum likelihood estimation. Your post helped very much with this. But now I am realising that the statsmodels package does not seem to be able to get MLE in the regime I am interested in (large number of covariates, small number of observations). I realise however that different packages work very sifferently. Some have inbuilt regularisation, which would not be good in — Meep, Jul 21 '19 at 12:13
my study. It doesn't seem that statsmodels has this, but it is not able to get estimates for large dimensional data. I have tried altering the code in the modules for logistic regression, but with no avail. However there are papers that have run simulations in this regime, so t must be explorable with some package... unless they just tried to run it MANY times and got lucky in <1/100 times (about as many times as I have tried to run, without a single success) — Meep, Jul 21 '19 at 12:16
At the moment, I am just trying to figure out how to get the computational tools to work (my weaker part) so I can compare the theory (my stronger part) with the simulations. — Meep, Jul 21 '19 at 12:17

score 2 · Accepted Answer · answered Jul 21 '19 at 12:47

In the code you currently use (from several other questions), you implicitly use the Newton-Raphson method. This is the default for the sm.Logit model. It computes and inverts the Hessian matrix to speed-up estimation, but that is incredibly costly for large matrices - let alone oft results in numerical instability when the matrix is near singular, as you have already witnessed. This is briefly explained on the relevant Wikipedia entry.

You can work around this by using a different solver, like e.g. the bfgs (or lbfgs), like so,

model = sm.Logit(y, X)
result = model.fit(method='bfgs')

This runs perfectly well for me even with n = 10000, p = 2000.

Aside from estimation, and more problematically, your code for generating samples results in data that suffer from a large degree of quasi-separability, in which case the whole MLE approach is questionable at best. You should urgently look into this, as it suggests your data may not be as well-behaved as you might like them to be. Quasi-separability is very well explained here.

Amazing! The bfgs really works well, and I am really interested to see that it gives me a much bigger variance for the estimators- as was in the papers I have read. I'll try to figure out why this is :) — Meep, Jul 21 '19 at 13:12

Approaches for using statistics packages for maximum likelihood estimation for hundreds of covariates

1 Answers1