1

I'm trying to do an F-test of equality of coefficient for the three experimental groups I have in my data.

I've run a regression to evaluate the results of a random control trial that included four groups, G1, G2, G3 and control.

Now I need to determine that the experimental groups (G1, G2, G3) are equal.

I know I can do this using Statsmodel's OLSResults.f_test. But I am unclear on how to configure it. The website gives examples, but I'm not sure how to translate it: https://www.statsmodels.org/stable/generated/statsmodels.regression.linear_model.OLSResults.f_test.html

The example given there is:

from statsmodels.datasets import longley
from statsmodels.formula.api import ols
dta = longley.load_pandas().data
formula = 'TOTEMP ~ GNPDEFL + GNP + UNEMP + ARMED + POP + YEAR'
results = ols(formula, dta).fit()
hypotheses = '(GNPDEFL = GNP), (UNEMP = 2), (YEAR/1829 = 1)'
f_test = results.f_test(hypotheses)
print(f_test)

How do I essentially write the below hypotheses so that I can check whether my 3 experimental groups are different?

hypotheses = '(G1=G2), (G1=G3), (G2=G3)'
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
Jameson
  • 167
  • 2
  • 11
  • 1
    Look at the `results.summary()` to see what names were created for the levels. Then, use those to specify the equality hypothesis. For more specific help you need to show your formula and summary. – Josef May 22 '20 at 03:58
  • Damn it Josef....I've come back to this issue, and you were right all along. I owe you so much. Thanks – Jameson Feb 02 '21 at 15:43

1 Answers1

1

We can use the iris example:

from statsmodels.formula.api import ols
import pandas as pd

data = load_iris()
df = pd.read_csv("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data",
                 header=None,names=["s_wid","s_len","p_wid","p_len","species"])

df.species.unique()

array(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'], dtype=object)

There's 3 categories in species and we can fit a model like you did:

formula = 's_len ~ species'
results = ols(formula, df).fit()

If we look at the results:

results.summary()

Dep. Variable:  s_len   R-squared:  0.392
Model:  OLS Adj. R-squared: 0.384
Method: Least Squares   F-statistic:    47.36
Date:   Sat, 23 May 2020    Prob (F-statistic): 1.33e-16
Time:   01:07:39    Log-Likelihood: -49.688
No. Observations:   150 AIC:    105.4
Df Residuals:   147 BIC:    114.4
Df Model:   2       
Covariance Type:    nonrobust

    coef    std err t   P>|t|   [0.025  0.975]
Intercept   3.4180  0.048   70.998  0.000   3.323   3.513
species[T.Iris-versicolor]  -0.6480 0.068   -9.518  0.000   -0.783  -0.513
species[T.Iris-virginica]   -0.4440 0.068   -6.521  0.000   -0.579  -0.309

If your model consist of only the groups, like the above, then the F-statistic (47.36) and p.value (1.33e-16) is what you need. This F-test test this model against an intercept only model.

A more detailed explanation: the model is fitted with Iris-setosa as reference, and the other two species effects on sepal length, s_len , are calculated as coefficients with respect to Iris-setosa. If we look at the mean, this becomes clear:

df.groupby('species')['s_len'].mean()

Iris-setosa        3.418
Iris-versicolor    2.770
Iris-virginica     2.974

In this case, the hypothesis is Iris-versicolor = 0 and Iris-virginica=0, so that the groups are all equal:

hypotheses = '(species[T.Iris-versicolor] = 0), (species[T.Iris-virginica] = 0)'
results.f_test(hypotheses)

<class 'statsmodels.stats.contrast.ContrastResults'>
<F test: F=array([[47.3644614]]), p=1.327916518456957e-16, df_denom=147, df_num=2>

Now, you can see this is exactly the same as the F-statistic provided in the summary.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
  • I feel so stupid for not having worked that out myself while at the same time so enlightened after your explanation. Thank you so much. – Jameson May 23 '20 at 11:13
  • @Jameson, you're welcome.. no worries.. no one is stupid.. we just take turns figuring things out :) – StupidWolf May 23 '20 at 11:59