2

I have data where I'm modeling a binary dependent variable. There are 5 other categorical predictor variables and I have the chi-square test for independence for each of them, vs. the dependent variable. All came up with very low p-values.

Now, I'd like to create a chart that displays all of the differences between the observed and expected counts. It seems like this should be part of the scipy chi2_contingency function but I can't figure it out.

The only thing I can think of is that the chi2_contingency function will output an array of expected counts, so I guess I need to figure out how to convert my cross tab table of observed counts into an array and then subtract the two.

  ## Gender & Income:  cross-tabulation table and chi-square

  ct_sex_income=pd.crosstab(adult_df.sex, adult_df.income, margins=True)
  ct_sex_income

  ## Run Chi-Square test

  scipy.stats.chi2_contingency(ct_sex_income)

  ## try to subtract them

  ct_sex_income.observed - chi2_contingency(ct_sex_income)[4]

Error I get is "AttributeError: 'DataFrame' object has no attribute 'observed'"

I'd like just an array that shows the differences.

TIA for any help

  • `ct_sex_income['observed'] - chi2_contingency(ct_sex_income)[4]` ? – Celius Stingher Oct 11 '19 at 17:56
  • Thanks for quick response. Unfortunately, that throws the following error: KeyError: 'observed' – immaprogrammingnoob Oct 11 '19 at 18:02
  • The main df is "adult_df." For this particular comparison, I used pandas crosstab() to create the cross tabulation table from adult_df.sex and adult_df.income. The latter variable is not a numerical variable but a categorical variable (<$50K, and >$50K). So no, "ct_sex_income" is not a column. Hope this helps – immaprogrammingnoob Oct 11 '19 at 18:12

1 Answers1

1

I don't know your data and have no clue about how your observed function is defined. I couldn't understand much of your intention, probably something about predicting people's income based on their marital status.

I am posting here one possible solution for your problem.

        import pandas as pd
        import numpy as np
        import scipy.stats as stats
        from scipy.stats import chi2_contingency

        # some bogus data
        data = [['single','30k-35k'],['divorced','40k-45k'],['married','25k-30k'],
                ['single','25k-30k'],['married','40k-45k'],['divorced','40k-35k'],
                ['single','30k-35k'],['married','30k-35k'],['divorced','30k-35k'],
                ['single','30k-35k'],['married','40k-45k'],['divorced','25k-30k'],
                ['single','40k-45k'],['married','30k-35k'],['divorced','30k-35k'],
                ]

        adult_df = pd.DataFrame(data,columns=['marital','income'])

        X = adult_df['marital'] #variable
        Y = adult_df['income']  #prediction

        dfObserved = pd.crosstab(Y,X) 

        results = []

        #Chi-Statistic, P-Value, Degrees of Freedom and the expected frequencies
        results =  stats.chi2_contingency(dfObserved.values)
        chi2  = results[0] 
        pv    = results[1]
        free  = results[2]
        efreq = results[3]

        dfExpected = pd.DataFrame(efreq, columns=dfObserved.columns, index = dfObserved.index)

        print(dfExpected)
        """
        marital  divorced   married    single
        income                               
        25k-30k  1.000000  1.000000  1.000000
        30k-35k  2.333333  2.333333  2.333333
        40k-35k  0.333333  0.333333  0.333333
        40k-45k  1.333333  1.333333  1.333333
        """

        print(dfObserved)
        """ 
        marital  divorced  married  single
        income                            
        25k-30k         1        1       1
        30k-35k         2        2       3
        40k-35k         1        0       0
        40k-45k         1        2       1
        """

        difference = dfObserved - dfExpected
        print(difference)
        """"
        marital  divorced   married    single
        income                               
        25k-30k  0.000000  0.000000  0.000000
        30k-35k -0.333333 -0.333333  0.666667
        40k-35k  0.666667 -0.333333 -0.333333
        40k-45k -0.333333  0.666667 -0.333333
        """

I hope it helps

powerPixie
  • 718
  • 9
  • 20
  • Thanks for this. It was very helpful. The model is more than just marital status to predict income level, but I needed to compare each independent categorical variable to income level. It looks like the key portion of the code was the following line, which allowed me to subtract data frames: dfExpected = pd.DataFrame(efreq, columns=dfObserved.columns, index = dfObserved.index) – immaprogrammingnoob Oct 14 '19 at 14:12