Correlation among multiple categorical variables

Question

I have a data set made of 22 categorical variables (non-ordered). I would like to visualize their correlation in a nice heatmap. Since the Pandas built-in function

DataFrame.corr(method='pearson', min_periods=1)

only implement correlation coefficients for numerical variables (Pearson, Kendall, Spearman), I have to aggregate it myself to perform a chi-square or something like it and I am not quite sure which function use to do it in one elegant step (rather than iterating through all the cat1*cat2 pairs). To be clear, this is what I would like to end up with (a dataframe):

         cat1  cat2  cat3  
  cat1|  coef  coef  coef  
  cat2|  coef  coef  coef
  cat3|  coef  coef  coef

Any ideas with pd.pivot_table or something in the same vein?

score 31 · Accepted Answer · edited May 22 '21 at 04:10

31

You can using pd.factorize

df.apply(lambda x : pd.factorize(x)[0]).corr(method='pearson', min_periods=1)
Out[32]: 
     a    c    d
a  1.0  1.0  1.0
c  1.0  1.0  1.0
d  1.0  1.0  1.0

Data input

df=pd.DataFrame({'a':['a','b','c'],'c':['a','b','c'],'d':['a','b','c']})

Update

from scipy.stats import chisquare

df=df.apply(lambda x : pd.factorize(x)[0])+1

pd.DataFrame([chisquare(df[x].values,f_exp=df.values.T,axis=1)[0] for x in df])

Out[123]: 
     0    1    2    3
0  0.0  0.0  0.0  0.0
1  0.0  0.0  0.0  0.0
2  0.0  0.0  0.0  0.0
3  0.0  0.0  0.0  0.0

df=pd.DataFrame({'a':['a','d','c'],'c':['a','b','c'],'d':['a','b','c'],'e':['a','b','c']})

edited May 22 '21 at 04:10

Trenton McKinney

56,955
33
144
158

answered Dec 30 '17 at 15:49

BENY

317,841
20
164
234

1

sounds like a good plan but, from what I understood, I can't use pearson on categorical data. Would it be possible to modify this code to end up with chi-squared? – zar3bski Dec 30 '17 at 16:05
1

@DavidZarebski https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chisquare.html – BENY Dec 30 '17 at 16:08
I saw it but I end up with a (8124, 22) matrix instead of the (22,22) I am looking for. (I have 8124 observation). If you see what I mean – zar3bski Dec 30 '17 at 17:03
@DavidZarebski you can check this one : -) https://codereview.stackexchange.com/questions/96761/chi-square-independence-test-for-two-pandas-df-columns – BENY Dec 30 '17 at 17:22
@DavidZarebski isn't the Pearson test's full name the _Pearson's chi-squared test_ ? I think you might be overly complicating things by avoiding it. – roberto tomás Jul 12 '18 at 13:20
2

@robertotomás There is something called the Pearson's chi-squared test (which leads to some confusions sometimes (https://en.wikipedia.org/wiki/Pearson%27s_chi-squared_test)) and yes, it is intended to measure the correlation between categorical variables (like a regular Chi2). However, it seems to me that it differs from the so called Pearson correlation (resp. Kendall, Spearman) (see (https://en.wikipedia.org/wiki/Pearson_correlation_coefficient)) intended to apply to numerical variables. Calling the .corr(method='pearson') method in pandas involves the latter. – zar3bski Jul 12 '18 at 13:33

zar3bski · Answer 2 · 2020-04-24T16:55:11.383

3

Turns out, the only solution I found is to iterate trough all the factor*factor pairs.

factors_paired = [(i,j) for i in df.columns.values for j in df.columns.values] 

chi2, p_values =[], []

for f in factors_paired:
    if f[0] != f[1]:
        chitest = chi2_contingency(pd.crosstab(df[f[0]], df[f[1]]))   
        chi2.append(chitest[0])
        p_values.append(chitest[1])
    else:      # for same factor pair
        chi2.append(0)
        p_values.append(0)

chi2 = np.array(chi2).reshape((23,23)) # shape it as a matrix
chi2 = pd.DataFrame(chi2, index=df.columns.values, columns=df.columns.values) # then a df for convenience

edited Apr 24 '20 at 16:55

answered Dec 31 '17 at 15:20

zar3bski

2,773
7
25
58

zar3bski Where is Chitest defined here? – ahlusar1989 Apr 24 '20 at 15:58
1

[here](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html) I would say (it's been a while). I must have included something like `from scipy.stats import chi2_contingency` in the beginning of the script (which I do not have anymore) – zar3bski Apr 24 '20 at 16:04
why do you use chitest – Maths12 Feb 04 '21 at 18:24
why did you use 23, 23 to reshape the array, is it because OP has mentioned he has 22 categorical columns? – Sidrah Madiha Siddiqui Apr 19 '21 at 04:00

score 2 · Answer 3 · answered Jul 18 '22 at 19:52

Using association-metrics python package to calculate Cramér's coefficient matrix from a pandas.DataFrame object it's quite simple; let me show you:

First install association_metrics using:

pip install association-metrics

Then, you can use the following pseudocode

# Import association_metrics  
import association_metrics as am
# Convert you str columns to Category columns
df = df.apply(
        lambda x: x.astype("category") if x.dtype == "O" else x)

# Initialize a CamresV object using you pandas.DataFrame
cramersv = am.CramersV(df) 
# will return a pairwise matrix filled with Cramer's V, where columns and index are 
# the categorical variables of the passed pandas.DataFrame
cramersv.fit()

Correlation among multiple categorical variables

3 Answers3