2

I have used Statsmodels to generate a OLS linear regression model to predict a dependent variable based on about 10 independent variables. The independent variables are all categorical.

I am interested in looking closer at the significance of the coefficients for one of the independent variables. There are 4 categories, so 3 coefficients -- each of which are highly significant. I would also like to look at the significance of the trend across all 3 categories. From my (limited) understanding, this is often done using a Wald Test and comparing all of the coefficients to 0.

How exactly is this done using Statsmodels? I see there is a Wald Test method for the OLS function. It seems you have to pass in values for all of the coefficients when using this method.

My approach was the following...

First, here are all of the coefficients:

np.array(lm.params) = array([ 0.21538725,  0.05675108,  0.05020252,  0.08112228,  0.00074715,
        0.03886747,  0.00981819,  0.19907263,  0.13962354,  0.0491201 ,
       -0.00531318,  0.00242845, -0.0097336 , -0.00143791, -0.01939182,
       -0.02676771,  0.01649944,  0.01240742, -0.00245309,  0.00757727,
        0.00655152, -0.02895381, -0.02027537,  0.02621716,  0.00783884,
        0.05065323,  0.04264466, -0.13068456, -0.15694931, -0.25518566,
       -0.0308599 , -0.00558183,  0.02990139,  0.02433505, -0.01582824,
       -0.00027538,  0.03170669,  0.01130944,  0.02631403])

I am only interested in params 2-4 (which are the 3 coefficients of interest).

coeffs = np.zeros_like(lm.params)
coeffs = coeffs[1:4] = [0.05675108,  0.05020252,  0.08112228]

Checking to make sure this worked:

array([ 0.        ,  0.05675108,  0.05020252,  0.08112228,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ,  0.        ,
        0.        ,  0.        ,  0.        ,  0.        ])

Looks good, now to run in the test!

lm.wald_test(coeffs) = 
<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[ 13.11493673]]), p=0.000304699208434, df_denom=1248, df_num=1>

Is this the correct approach? I could really use some help!

JHawkins
  • 47
  • 2
  • 6

1 Answers1

6

A linear hypothesis has the form R params = q where R is the matrix that defines the linear combination of parameters and q is the hypothesized value.

In the simple case where we want to test whether some parameters are zero, the R matrix has a 1 in the column corresponding to the position of the parameter and zeros everywhere else, and q is zero, which is the default. Each row specifies a linear combination of parameters, which defines a hypothesis as part of the overall or joint hypothesis.

In this case, the simplest way to get the restriction matrix is by using the corresponding rows of an identity matrix

R = np.eye(len(results.params))[1:4]

Then, lm.wald_test(R) will provide the test for the joint hypothesis that the 3 parameters are zero.

A simpler way to specify the restriction is by using the names of the parameters and defining the restrictions by a list of strings.

The model result classes also have a new method wald_test_terms which automatically generates the wald tests for terms in the design matrix where the hypothesis includes several parameters or columns, as in the case of categorical explanatory variables or of polynomial explanatory variables. This is available in statsmodels master and will be in the upcoming 0.7 release.

Josef
  • 21,998
  • 3
  • 54
  • 67
  • Thank you so much! This was extremely helpful! For my own curiosity, would you mind giving me an example of how to specify the restriction using the names of the parameters? Keeping it simple, let's say there are the following parameters: `'Ind1_A,' 'Ind1_B', 'Ind1_C', Ind2_A', Ind2_B', 'Ind2_C', 'Ind3_A', Ind3_B', 'Ind3_C'`. And suppose I only want to test the trend for the 3 `'Ind1_'` params. – JHawkins Mar 08 '15 at 19:57
  • I need to look for the examples for how to specify the strings. The f_test is the same as the wald_test but always uses the F-distribution. The interface is the same. Some examples are at the end here http://statsmodels.sourceforge.net/devel/generated/statsmodels.regression.linear_model.RegressionResults.f_test.html. – Josef Mar 08 '15 at 21:01
  • 4
    try list of strings `wald_test(['Ind1_A,' 'Ind1_B', 'Ind1_C'])` and try comma separated string `wald_test('Ind1_A, Ind1_B, Ind1_C')`. I don't have an example at hand to check what the supported syntax for the separation between hypotheses is. If there is no equal sign, then it defaults to equal zero. More complicated linear expressions are supported. like `'Ind1_A + 2 * Ind1_B - Ind1_C = 5'`. – Josef Mar 08 '15 at 21:20
  • http://patsy.readthedocs.org/en/latest/API-reference.html#patsy.DesignInfo.linear_constraint – Josef Mar 08 '15 at 21:42
  • Thanks @user333700! The list of strings worked. I had tried that before (well before I tried the approach listed in my question), but I left off the added label (e.g., 'Ind1' instead of 'Ind1[T.25-50%]'). – JHawkins Mar 09 '15 at 14:48