5

Can someone help me with scipy.stats.chisquare? I do not have a statistical / mathematical background, and I am learning scipy.stats.chisquare with this data set from https://en.wikipedia.org/wiki/Chi-squared_test

The Wikipedia article gives the table below as an example, stating the Chi-squared value based on it is approximately 24.6. I am to use scipy.stats to verify this value and calculate the associated p value.

enter image description here

I have found what looks like the most likely formula solutions to help me here

https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html

enter image description here

As I am new to statistics, and also the use of scipy.stats.chisquare I am just not sure of the best approach, and how best to enter the data from provided table into the arrays, and whether to supply expected values? from Wikipedia.

Christopher
  • 427
  • 1
  • 8
  • 18

1 Answers1

11

That data is a contingency table. SciPy has the function scipy.stats.chi2_contingency that applies the chi-square test to a contingency table. It is fundamentally just a reqular chi-square test, but when applied to a contingency table, the expected frequencies are calculated under the assumption of independence (chi2_contingency does this for you), and the degrees of freedom depends on the number of rows and columns (chi2_contingency calculates this for you, too).

Here's how you can apply the chi-square test to that table:

import numpy as np
from scipy.stats import chi2_contingency


table = np.array([[90, 60, 104, 95],
                  [30, 50,  51, 20],
                  [30, 40,  45, 35]])

chi2, p, dof, expected = chi2_contingency(table)

print(f"chi2 statistic:     {chi2:.5g}")
print(f"p-value:            {p:.5g}")
print(f"degrees of freedom: {dof}")
print("expected frequencies:")
print(expected)

Output:

chi2 statistic:     24.571
p-value:            0.00040984
degrees of freedom: 6
expected frequencies:
[[ 80.53846154  80.53846154 107.38461538  80.53846154]
 [ 34.84615385  34.84615385  46.46153846  34.84615385]
 [ 34.61538462  34.61538462  46.15384615  34.61538462]]
Warren Weckesser
  • 110,654
  • 19
  • 194
  • 214
  • Thank you so much Warren Weckesser that is extremely helpful. I am curious about the .g you used in the print formatting I am trying to search that by its name. – Christopher Nov 03 '20 at 20:31
  • That's the "general format" for formatting floating point values. Check out ["Format Specification Mini-Language"](https://docs.python.org/3/library/string.html#format-specification-mini-language); scroll down to the table of codes that follows the sentence "The available presentation types for floating point and decimal values are:". – Warren Weckesser Nov 03 '20 at 20:41
  • When to use below chi-squared? 1. from scipy.stats import chi2_contingency 2. from scipy.stats import chisquare 3. from sklearn.feature_selection import SelectKBest from sklearn.feature_selection import chi2 – Anuganti Suresh Jul 26 '21 at 04:18
  • That's a really low `p-value` - and given the discrepancy between expected and actual frequencies it can not be representing the typical _significance testing_ `p-value` : so what does it mean here? – WestCoastProjects Oct 17 '21 at 16:57
  • @WestCoastProjects, it is the p-value computed using the chi-square test. What is the problem with the value? – Warren Weckesser Oct 17 '21 at 17:22
  • @WarrenWeckesser Rephrasing my comment: the actual frequencies are highly divergent from the expected ones: so how canwe have a super low p-value here? Typically a p-value of 0.01 reflects a high correspondence to the null hypothesis: here we have 25X _better_ (lower) p-value yet the expected/actuals are dissimilar. I am reading up more on chi-square to understand what their p-values actually mean here. – WestCoastProjects Oct 17 '21 at 18:09
  • 1
    You have it backwards. A low p-value is evidence *against* the null hypothesis. In this case, the null hypothesis is that there is no association among the categories. A large discrepancy of the frequencies from the expected frequencies suggests that there *is* some sort of association or dependence. Roughly: large discrepancy ⇒ large chi-square statistic ⇒ small p-value. – Warren Weckesser Oct 17 '21 at 19:10