Statsmodels with partly identified model

Question

I am trying to run a regression where only some of the coefficients can be identified:

data = np.array([[2, 1, 1, 1], [1, 1, 1, 0]])
df = pd.DataFrame(data, columns=['y', 'x1', 'x2', 'x3'])
z = df.pop('y')
mod = sm.OLS(z, sm.add_constant(df))

Now, I have two outcomes, and the only variables that changes between the two observations is x3. So, I would expect that (since I added a constant), the model would be unable to identify x1 or x2, and would omit those. It should however give me a 1 for x3, since the presence of that effect increases y by one.

Stata does exactly give me this outcome, and it reminds me that it cannot estimate a standard error on the coefficient for x3. statsmodels, on the other hand...

res = mod.fit()
res.summary()
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       1.000
Model:                            OLS   Adj. R-squared:                    nan
Method:                 Least Squares   F-statistic:                       nan
Date:                Sun, 30 Aug 2020   Prob (F-statistic):                nan
Time:                        14:28:28   Log-Likelihood:                 66.947
No. Observations:                   2   AIC:                            -129.9
Df Residuals:                       0   BIC:                            -132.5
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
x1             0.5000        inf          0        nan         nan         nan
x2             0.5000        inf          0        nan         nan         nan
x3             1.0000        inf          0        nan         nan         nan
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   0.200
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.333
Skew:                           0.000   Prob(JB):                        0.846
Kurtosis:                       1.000   Cond. No.                         3.23
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The input rank is higher than the number of observations.
"""

What is happening here? And how can I get my expected output?

score 1 · Answer 1 · answered Aug 30 '20 at 14:16

1

statsmodels uses the Moore-Penrose generalized inverse pinv to estimate the parameters in linear regression model, OLS. WLS, GLS.

So, it provides a regularized solution if the design matrix is singular.

The covariance matrix of the parameter estimate has reduced rank, and only some linear combinations of parameters will be identified.

However, the model can be used for prediction, if the linear relationship in the data remains the same in prediction samples.

answered Aug 30 '20 at 14:16

Josef

21,998
3
54
67

"Some linear combinations of parameters will be identified": I cannot see how the model could separately identify `x1` and `x2`, who are only jointly appearing in the data. Furthermore, it drops out the constant to do so? – FooBar Aug 30 '20 at 15:08
If two variables are perfectly collinear, then each individual coefficient is not identified, but the sum of the two coefficients is identified, which is the same as adding the two variables into one. – Josef Aug 30 '20 at 17:18
That I understand. But here, `statsmodels` appears to separately identify each of `x1` and `x2` with a coefficient of `0.5` -- isn't it? Or is the 0.5 somehow linked to the sum of `x1` and `x2`? – FooBar Aug 31 '20 at 07:28
Or, in other words, how do I know whether coefficients represent an individual variable, or whether I have to add up several of them because they can only identified groupwise? It isn't clear to me purely from the output that I'm seeing here. – FooBar Aug 31 '20 at 07:32
You can select a model that does not have a singular design matrix, e.g. by dropping variables. statsmodels add the notes or warnings if the design matrix is singular or near singular. How to find or check linear combinations that are identified in the singular case is discussed under "estimable functions" in statistics. But I have not figured out the computational details yet, so statsmodels does not yet have helper functions for that case. – Josef Aug 31 '20 at 13:46

Statsmodels with partly identified model

1 Answers1

Linked