Why aren't the expected frequencies returned by scipy.ststs.contingency.expected_freq what I expect?

Question

I have a data frame which I am wanting to calculate a chi squared and p-value for. However, when I print out the expected values they are not what I expect. The null hypothesis I was expecting the code to test is that there is no dependence of Q7 on 'ConcernImprovement', so I expected the 'expected frequencies' for decrease, increase and no change to be the same for each Q7 entry

This is my observed data frame which is called LikelihoodConcern:

ConcernImprovement  Decrease  Increase  No change
Q7                                               
Likely                   2.0      18.0       21.0
Not likely at all        0.0       2.0        1.0
Not very likely          3.0      11.0        5.0
Somewhat likely          4.0      24.0       14.0
Very likely              1.0      16.0        8.0

I tried this code:

from scipy.stats import chi2_contingency
chi2, p, dof, expected = chi2_contingency(LikelihoodConcern, correction=False)
expected

It returns this for the expected frequencies:

array([[ 3.15384615, 22.39230769, 15.45384615],
       [ 0.23076923,  1.63846154,  1.13076923],
       [ 1.46153846, 10.37692308,  7.16153846],
       [ 3.23076923, 22.93846154, 15.83076923],
       [ 1.92307692, 13.65384615,  9.42307692]])

I expected it to return:

array([[ 13.67777777, 13.67777777, 13.67777777],
       [ 1.00000000,  1.00000000,  1.00000000],
       [ 6.33333333, 6.33333333,  6.33333333],
       [ 14.00000000, 14.00000000, 14.00000000],
       [ 8.33333333, 8.33333333,  8.33333333]])

I have looked at the source code for the expected_freq function as the documentation doesn't have much detail - but I still don't understand why I am not seeing what I expect

The relative proportions of the values in each row are determined by the relative proportions of the sums of the columns. In your example, the sums of the columns are [10, 71, 49]. In the expected array, each row is proportional to that marginal sum. — Warren Weckesser, Aug 26 '19 at 14:05
Hi Warren, I think I understand what you're saying. So I think in the case for what I want to do, the scipy expected frequencies is not appropriate. But I will check in textbooks and online first. — Joanne Cook, Aug 26 '19 at 14:26
Ah I understand now how the formula works. What I expected wasn't correct because both the row and column sums have to equal what they did before, but in the expected version I thought I should get they didn't. — Joanne Cook, Aug 26 '19 at 14:52

score 0 · Accepted Answer · answered Aug 26 '19 at 13:38

I gave it a test there, with the same input data as you had:

array([[ 2., 18., 21.],
   [ 0.,  2.,  1.],
   [ 3., 11.,  5.],
   [ 4., 24., 14.],
   [ 1., 16.,  8.]])

and got back the same results that you did for expected frequencies. If we look at the first cell (row 'Likely', column 'Decrease'). The marginal sum for 'Likely' is 42, and for 'Decrease' it is 10. The marginal sum for the table is 130. Thus for the first cell we have an expected value of:

(10 * 41) / 130 = 3.1538461538461537

For the the bottom right cell (row 'Verly likely', column 'No change') we have:

(49 * 25) / 130 = 9.423076923076923

etc. These match up with the results from stats.scipy.

Ah okay thankyou! So if scipy is calculating it correctly then I guess my problem is completely seperate in that I don't understand expected frequencies. Thank you for your answer! I'll go find some stats resources now to help me out :) — Joanne Cook, Aug 26 '19 at 13:57

Why aren't the expected frequencies returned by scipy.ststs.contingency.expected_freq what I expect?

1 Answers1