How to plot correlation matrix/heatmap with categorical and numerical variables

Question

I have 4 variables of which 2 variables are nominal (dtype=object) and 2 are numeric(dtypes=int and float).

df.head(1)

OUT:
OS_type|Week_day|clicks|avg_app_speed
iOS|Monday|400|3.4

Now, I want to throw the dataframe into a seaborn heatmap visualization.

import numpy as np
import seaborn as sns
ax = sns.heatmap(df)

But I get an error indicating I cannot use categorical variables, only numbers. How do I process this correctly and then feed it back into the heatmap?

You can try to define your categorical columns as binary data and then apply the correlation matrix. [Related topic](https://stackoverflow.com/questions/44694228/how-to-check-for-correlation-among-continuous-and-categorical-variables-in-pytho) — Alexandre B., Jul 30 '19 at 21:52

score 0 · Answer 1 · answered Apr 16 '22 at 03:38

The heatmap to be plotted needs values between 0 and 1. For correlations between numerical variables you can use Pearson's R, for categorical variables (the corrected) Cramer's V, and for correlations between categorical and numerical variables you can use the correlation ratio.

As for creating numerical representations of categorical variables there is a number of ways to do that:

import pandas as pd
from sklearn.preprocessing import LabelEncoder

df = pd.read_csv('some_source.csv')  # has categorical var 'categ_var'

# method 1: uses pandas
df['numerized1'] = df['categ_var'].astype('category').cat.codes

# method 2: uses pandas, sorts values descending by frequency
df['numerized2'] = df['categ_var'].apply(lambda x: df['categ_var'].value_counts().index.get_loc(x))

# method 3: uses sklearn, result is the same as method 1
lbl = LabelEncoder()
df['numerized3'] = lbl.fit_transform(df['categ_var'])

# method 4: uses pandas; xyz captures a list of the unique values 
df['numerized4'], xyz = pd.factorize(df['categ_var'])

How to plot correlation matrix/heatmap with categorical and numerical variables

1 Answers1