0

I'm unit acceptance testing some code I wrote. It's conceivable that at some point in the real world we will have input data where the dependent variable is constant. Not the norm, but possible. A linear model should yield coefficients of 0 in this case (right?), which is fine and what we would want -- but for some reason I'm getting some wild results when I try to fit the model on this use case.

I have tried 3 models and get diffirent weird results every time -- or no results in some cases.

For this use case all of the dependent observations are set at 100, all the freq_weights are set at 1, and the independent variables are a binary coded dummy set of 20 features.

In total there are 150 observations.

Again, this data is unlikely in the real world but I need my code to be able to work on this ugly data. IDK why I'm getting such erroneous and different results.

As I understand with no variance in the dependent variable I should be getting 0 for all my coefficients.

freq = freq['Freq']


Indies = sm.add_constant(df)

model = sm.OLS(df1, Indies)

res = model.fit()

res.params

yields:

const               65.990203
x1                  17.214836

reg = statsmodels.GLM(df1, Indies, freq_weights = freq)

results = reg.fit(method = 'lbfgs', max_start_irls=0)

results.params

yields:

const               83.205034
x1                  82.575228

reg = statsmodels.GLM(df1, Indies, freq_weights = freq)

result2 = reg.fit()

result2.params

yields

PerfectSeparationError: Perfect separation detected, results not available
  • What's the matrix rank of `Indies`? Does it have full rank or not? Constant dependent (endog) variable, should not be a problem. Several diagnostic hypothesis tests use a vector of ones for `endog` in OLS without problems.(But those don't include a constant in exog, so not directly comparable.) – Josef May 21 '20 at 23:41
  • you only have one `x1`. That doesn't match up with your description of 20 dummy variables. – Josef May 21 '20 at 23:43
  • I excluded the others for simplicity -- its just a bunch of x's and coefficients. I'm not sure what you mean by "rank"? It's a dummy coded matrix of 20 columns. – Data of All Kinds May 22 '20 at 20:02
  • For example, if you did not drop a reference category, then you would have perfect collinearity with the constant and the rank of the design matrix would be less than the number of columns. In that case parameters are not identified and the estimates depend on the details of the optimization algorithm. OLS Results `summary` would print a warning about multicollinearity in that case. – Josef May 22 '20 at 20:47
  • I dropped the reference category for each variable. There is no multidisciplinary issue as far as I can tell. No warning and no perfectly correlated features. – Data of All Kinds May 23 '20 at 14:36
  • Then you need to make a reproducible example. I don't have any other guess. Unfortunately, there is currently no way to turn of the PerfectSeparationError exception when estimating GLM with irls. – Josef May 23 '20 at 17:55
  • My apologies -- you were right -- I'm getting the small eigen value error and the condition number is very large...... Is there a way to access that information without using the summary method? I want to be able to detect this in production so is there some way to access the value itself so that I can have a flag in the code? – Data of All Kinds May 26 '20 at 17:18
  • You have to look at the code for the summary method. Some like matrix_rank is precomputed in linear regression models, some is computed in summary and not directly available outside. – Josef May 26 '20 at 17:31

0 Answers0