I was just looking at https://en.wikipedia.org/wiki/Chi-squared_test and wanted to recreate the example "Example chi-squared test for categorical data".
I feel that the approach I've taken might have room for improvement, so was wondering how that might be done.
Here's the code:
csv = """\
,A,B,C,D
White collar,90,60,104,95
Blue collar,30,50,51,20
No collar,30,40,45,35
"""
observed_workers = pd.read_csv(io.StringIO(csv), index_col=0)
col_sums = dt.apply(sum)
row_sums = dt.apply(sum, axis=1)
l = list(x[1] * (x[0] / col_sums.sum()) for x in itertools.product(row_sums, col_sums))
expected_workers = pd.DataFrame(
np.array(l).reshape((3, 4)),
columns=observed_workers.columns,
index=observed_workers.index,
)
chi_squared_stat = (
((observed_workers - expected_workers) ** 2).div(expected_workers).sum().sum()
)
This returns the correct value, but is probably ignorant of a nicer approach using some particular numpy / pandas methods.