Run a chi square test with observation and expectation counts and get confidence interval

Question

I am new to chi squared testing and trying to figure out what the 'standard' way of running a chi-squared test and also getting a 95% confidence interval on the difference between success rates in two experiments.

My data looks like this:

Condition A:    25           75           100
Condition B:    100          100          200
Total:          125          175

These numbers represent the counts of what was observed during an experiment. As you can see, the number of samples for Condition A vs. Condition B were not the same.

What I'd like to get is:

A test statistic indicating whether Condition A's success rate of 33% is statistically different from Condition B's success rate of 50%.
I'd also like to get a 95% confidence interval on the difference between the two success rates.

It seems like scipy.stats.chisquare expects the user to adjust the 'expected' counts so that they appear to be taken out of the same sample size as the 'observed' counts. Is that the only transformation I need to do? If not, what else do I need to do? Finally, how would I go about calculating the 95% confidence interval for the difference in proportions?

I don't think chi-squared is the appropriate test for what you want. — juanpa.arrivillaga, Aug 31 '16 at 01:17
What are the values in each column? Is it in a csv file or something else? Please [edit] your question and add this crucial information. — martineau, Aug 31 '16 at 01:50
@juanpa.arrivillaga I believe this is a classic example of chi squared application for categorical data. — helloB, Aug 31 '16 at 02:41
Sorry, that's not any better. The first column is self-explanatory, but it's important to know which of the next two columns are the counts of the success and failures (assuming the last column is their total). Also you haven't mentioned what format the data is in. Is it in a file or data-structure in memory or something else. We're not mind-readers... — martineau, Aug 31 '16 at 03:00
@helloB I meant with regards to the 95% confidence interval. I'll elaborate in an answer. — juanpa.arrivillaga, Aug 31 '16 at 05:36
@helloB Or rather, that wouldn't really be on topic for SO, so just check out [this link](http://sphweb.bumc.bu.edu/otlt/MPH-Modules/BS/BS704_Confidence_Intervals/BS704_Confidence_Intervals7.html) and it should be easy to implement. — juanpa.arrivillaga, Aug 31 '16 at 05:55

Warren Weckesser · Accepted Answer · 2018-11-07T22:54:24.697

You have a contingency table. To perform the χ² test on this data, you can use scipy.stats.chi2_contingency:

In [31]: from scipy.stats import chi2_contingency

In [32]: obs = np.array([[25, 75], [100, 100]])

In [33]: obs
Out[33]: 
array([[ 25,  75],
       [100, 100]])

In [34]: chi2, p, dof, expected = chi2_contingency(obs)

In [35]: p
Out[35]: 5.9148695289823149e-05

Your contingency table is 2x2, so you can use Fisher's exact test. This is implemented in scipy as scipy.stats.fisher_exact:

In [148]: from scipy.stats import fisher_exact

In [149]: oddsr, pval = fisher_exact(obs)

In [150]: pval
Out[150]: 3.7175015403965242e-05

scipy doesn't have much more for contingency tables. It looks like the next release of statsmodels will have more tools for the analysis of contingency tables, but that doesn't help right now.

It isn't hard to write some code to compute the difference in proportion and its 95% confidence interval. Here's one way:

# Include this if you are using Python 2.7.  Or tweak the code in the
# function to ensure that division uses floating point.
from __future__ import division


def diffprop(obs):
    """
    `obs` must be a 2x2 numpy array.

    Returns:
    delta
        The difference in proportions
    ci
        The Wald 95% confidence interval for delta
    corrected_ci
        Yates continuity correction for the 95% confidence interval of delta.
    """
    n1, n2 = obs.sum(axis=1)
    prop1 = obs[0,0] / n1
    prop2 = obs[1,0] / n2
    delta = prop1 - prop2

    # Wald 95% confidence interval for delta
    se = np.sqrt(prop1*(1 - prop1)/n1 + prop2*(1 - prop2)/n2)
    ci = (delta - 1.96*se, delta + 1.96*se)

    # Yates continuity correction for confidence interval of delta
    correction = 0.5*(1/n1 + 1/n2)
    corrected_ci = (ci[0] - correction, ci[1] + correction)

    return delta, ci, corrected_ci

For example,

In [22]: obs
Out[22]: 
array([[ 25,  75],
       [100, 100]])

In [23]: diffprop(obs)
Out[23]: 
(-0.25,
 (-0.35956733089748971, -0.14043266910251032),
 (-0.36706733089748972, -0.13293266910251031))

The first value returned is the difference in proportions delta. The next two pairs are the Wald 95% confidence interval for delta, and the Wald 95% confidence interval with Yates continuity correction.

If you don't like those negative values, you can reverse the rows first:

In [24]: diffprop(obs[::-1])
Out[24]: 
(0.25,
 (0.14043266910251032, 0.35956733089748971),
 (0.13293266910251031, 0.36706733089748972))

For comparison, here is a similar calculation in R:

> obs
     [,1] [,2]
[1,]   25   75
[2,]  100  100
> prop.test(obs, correct=FALSE)

    2-sample test for equality of proportions without continuity
    correction

data:  obs
X-squared = 17.1429, df = 1, p-value = 3.467e-05
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3595653 -0.1404347
sample estimates:
prop 1 prop 2 
  0.25   0.50 

> prop.test(obs, correct=TRUE)

    2-sample test for equality of proportions with continuity correction

data:  obs
X-squared = 16.1297, df = 1, p-value = 5.915e-05
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3670653 -0.1329347
sample estimates:
prop 1 prop 2 
  0.25   0.50

Wow thanks so much for this amazingly comprehensive answer! – helloB Aug 31 '16 at 15:07 — helloB, Aug 31 '16 at 15:07

score 1 · Answer 2 · edited Jun 06 '18 at 02:12

1

I would just add to Warren's great answer that n1 = float(n1) and n2 = float(n2). They should be cast as floats (for Python 2 users) otherwise the divisions will simply yield 0's.

edited Jun 06 '18 at 02:12

Stephen Rauch

47,830
31
106
135

answered Jun 06 '18 at 01:50

NoriEKSA

11
1

2

This should be a comment on that answer - not a separate answer. This makes no sense without that other answer, so it is not self-contained and should be deleted. [From Review](https://stackoverflow.com/review/low-quality-posts/32510756). – Wai Ha Lee Aug 18 '22 at 09:57

Run a chi square test with observation and expectation counts and get confidence interval

2 Answers2