Chi-squared for determining people voting in each category

Question

My dataset contains the following columns:

Voted? Political Category
1            Right
0            Left
1            Center
1            Right
1            Right
1            Right

I would need to see which category is mostly associated with people who voted. To do this, I would need to calculate the chi-squared. What I would like is to group by Voted? and Political Category in order to have something like this:

(1, Right) : 1500 people
(0, Right) : 202 people
(1, Left): 826 people
(0, Left): 652 people
(1, Center): 431 people
(0, Center): 542 people

In R, I would do:

yes = c(1500, 826, 431)
no  = c(212, 652, 542)
TBL = rbind(yes, no);  TBL

    [,1] [,2] [,3]
yes 1500  826  431
no   212  652  542

and apply

chisq.test(TBL, cor=F)

with:

X-squared = 630.08, df = 2, p-value < 2.2e-16

Even better if I use prop.test, as it would give the proportions of people voting in each political category.

   prop 1    prop 2    prop 3 
0.8761682 0.5588633 0.4429599

I would like to get the same, or similar, results in Python.

Warren Weckesser · Accepted Answer · 2022-01-22T01:09:01.867

1

Your data is in the form of a contingency table. SciPy has the function scipy.stats.chi2_contingency for applying the chi-squared test to a contingency table.

For example,

In [48]: import numpy as np

In [49]: from scipy.stats import chi2_contingency

In [50]: tbl = np.array([[1500, 826, 431], [212, 652, 542]])

In [51]: stat, p, df, expected = chi2_contingency(tbl)

In [52]: stat
Out[52]: 630.0807418107023

In [53]: p
Out[53]: 1.5125346728116583e-137

In [54]: df
Out[54]: 2

In [55]: expected
Out[55]: 
array([[1133.79389863,  978.82440548,  644.38169589],
       [ 578.20610137,  499.17559452,  328.61830411]])

edited Jan 22 '22 at 01:09

answered Jan 22 '22 at 01:01

Warren Weckesser

110,654
19
194
214

thanks Warren. I am getting the error: `TypeError: '<' not supported between instances of 'str' and 'int'` . May I ask you how I could get the frequency values (just an example) grouped as shown in tbl using the sample of data? thanks – LdM Jan 22 '22 at 02:12
Is the raw data in a file formatted like you show at the beginning of the question? – Warren Weckesser Jan 22 '22 at 03:14
yes, it is. But there might be NaN values in Political category. – LdM Jan 22 '22 at 12:03
even after fixing them I am getting the error. I have opened a new question for this issue – LdM Jan 23 '22 at 03:53

Chi-squared for determining people voting in each category

1 Answers1