I am trying to investigate the distribution of maximum likelihood estimates for specifically for a large number of covariates p and a high dimensional regime (meaning that p/n, with n the sample size, is about 1/5). I am generating the data and then using statsmodels.api.Logit
to fit the parameters to my model.
The problem is, this only seems to work in a low dimensional regime (like 300 covariates and 40000 observations). Specifically, I get that the maximum number of iterations has been reached, the log likelihood is inf i.e. has diverged, and a 'singular matrix' error.
I am not sure how to remedy this. Initially, when I was still working with smaller values (say 80 covariates, 4000 observations), and I got this error occasionally, I set a maximum of 70 iterations rather than 35. This seemed to help.
However it clearly will not help now, because my log likelihood function is diverging. It is not just a matter of non-convergence within the maixmum number of iterations.
It would be easy to answer that these packages are simply not meant to handle such numbers, however there have been papers specifically investigating this high dimensional regime, say here where p=800
covariates and n=4000
observations are used.
Granted, this paper used R rather than python. Unfortunately I do not know R. However I should think that python optimisation should be of comparable 'quality'?
My questions:
Might it be the case that R is better suited to handle data in this high p/n regime than python statsmodels? If so, why and can the techniques of R be used to modify the python statsmodels code?
How could I modify my code to work for numbers around p=800 and n=4000?