-2

I'd like to apply chi-square test scipy.stats.chisquare. And the total number of observations is different in my groups.

import pandas as pd

data={'expected':[20,13,18,21,21,29,45,37,35,32,53,38,25,21,50,62],
      'observed':[19,10,15,14,15,25,25,20,26,38,50,36,30,28,59,49]}

data=pd.DataFrame(data)
print(data.expected.sum())
print(data.observed.sum())

To ignore this is incorrect - right?

Does the default behavior of scipy.stats.chisquare takes this into account? I checked with pen and paper and looks like it doesn't. Is there a parameter for this?

from scipy.stats import chisquare
# incorrect since the number of observations is unequal 
chisquare(f_obs=data.observed, f_exp=data.expected)

When I do manual adjustment I get slightly different result.

# adjust actual number of observations
data['obs_prop']=data['observed'].apply(lambda x: x/data['observed'].sum())
data['observed_new']=data['obs_prop']*data['expected'].sum()

# proper way
chisquare(f_obs=data.observed_new, f_exp=data.expected)

Please correct me if I am wrong at some point. Thanks.

ps: I tagged R for additional statistical expertise

Anton
  • 109
  • 9
  • I don't understand your question. Both groups have 16 observations. What do you mean "And the total number of observations is different in my groups."? – wong.lok.yin Jan 21 '20 at 01:32
  • The observations are the sum of each vector, not the number of categories. If the sums are different, it is probably due to rounding error in computing the expected values. – dcarlson Jan 21 '20 at 02:07
  • I think the [Cross Validated](https://stats.stackexchange.com/) stackexchange site is a better forum for this question. – Warren Weckesser Jan 21 '20 at 12:39
  • Ok. will forward this question there. – Anton Jan 22 '20 at 18:48

2 Answers2

1

Basically this was a different statistical problem - Chi-square test of independence of variables in a contingency table.

from scipy.stats import contingency as cont
chi2, p, dof, exp=cont.chi2_contingency(data)
p
Anton
  • 109
  • 9
0

I didn't get the question quite well. However, the way I see it is that you can use scipy.stats.chi2_contingency if you want to compute the independence test between two categorical variable. Also the scipy.stats.chi2_sqaure can be used to compare observed vs expected. Here the number of categories should be the same. Logicaly a category would get a 0 frequency if there is an observed frequecy but the expeceted frequency does not exist and vice-versa.

Hope this helps

dito
  • 129
  • 1
  • 6