0

Have run into a problem fitting a binomial logistic regression, in that the results seem to be suspect between languages. Having spent an extended period looking into this and looking for online suggestions, (tried all data variations just in case as well), I believe it comes down to what fitting procedure MATLAB is using for glmfit (I have a sneaking suspicion its a Maximum Likelihood Estimator, whereas python and R use IRLS/IWLS.)

I first ran my problem in MATLAB using:

[b_lr,dev,stats] = glmfit(x',y','binomial','link','logit');

Where x' is a multi-column array with predictors and row length = y, and y is a response vector with a binary result based on the criterion.

Since that calculation I've moved to using python/R2py. I tried the same procedure in both Python and R for fitting a logit linked binomial using the equivalent of glmfit from statsmodels and got a different set of coefficients for the regression (note that the position of the response vector changes for these two):

glm_logit = sm.GLM(yvec.T,Xmat,family = sm.families.Binomial()).fit()

and using R2py:

%R glm.out = glm(Data ~ ONI + Percentiles, family=binomial(logit), data=df) 

Would appreciate if someone could clarify what MATLAB uses, and if anyone had suggestions for how to replicate the MATLAB result in python or R.

Adriaan
  • 17,741
  • 7
  • 42
  • 75
aphex
  • 11
  • 1

2 Answers2

0

Since it a very general question without any details, here is a partial answer that is also very general based on my comparison of R, Stata and statsmodels, I don't have matlab.

GLM is a maximum likelihood (or quasi-maximum likelihood) model. The parameter estimates should be independent of the optimizer, whether it's IRLS or something else. Differences can come from numerical precision problems, different convergence criteria or different handling of ill-defined problems.

First, you need to check that they are actually estimating the same model by comparing the design matrix across packages. The two main sources are whether a constant is included by default or not, and how categorical variables are encoded.

Second, check that the data allows for a well defined model. The main differences across packages are in the treatment of singular or almost singular cases, and how they treat perfect separation in the case of Logit and similar models.

Third, maybe it's a coding mistake. Since you don't provide a replicable examples, that's impossible to tell.

Josef
  • 21,998
  • 3
  • 54
  • 67
0

I came across similar problems for fitting a Poisson-based GLM: the resulting coefficients given by Matlab and R are quite different. After some investigation I found out that there is a subtle difference in the way a design matrix is constructed between Matlab and R. In GLM regression, one of the categorical variables is usually taken as "reference" (to enforce linear independence between the explaining variables? I am not sure, this is a good question). While Matlab simply ignores the reference variable by removing its occurrence from the design matrix (the corresponding column is removed), R not only removes the reference variable occurrences (column removed) but also sets "-1" to all other variables in the rows where the reference variable would occur exclusively. The result is a different set of coefficients, but the predicted responses are exactly the same. Does that make sense?

The only way I could make Matlab and R return the same answer was by composing the design matrix manually for Matlab. I hope this helps.