Correlation analysis with multiple data in a single cell

Question

I have a dataset with some rows containing singular answers and others having multiple answers. Like so:

       year  length  Animation
0      1971     121       1,2,3
1      1939      71       1,3
2      1941       7       0,2
3      1996      70       1,2,0
4      1975      71       3,2,0

With the singular answers I managed to create a heatmap using df.corr(), but I can't figure out what is the best approach for multiple answers rows.

I could split them and add additional columns for each answer like:

           year  length  Animation
    0      1971     121       1
    1      1971     121       2
    2      1971     121       3
    3      1939      71       1
    4      1939      71       3 ...

and then do the exact same dr.corr(), or add additional Animation_01, Animation_02 ... columns, but there must be a smarter way to work around this issue?

EDIT: Actual data snippet

What is your y variable for corr(x, y), assuming x = Animation? — Bill Huang, Apr 04 '21 at 22:53
The data listed is just example data. But could be corr(Animation,year) or corr(Animation,length). Could post snippet of actual data if that helps. But my question is basically how to use corr() with rows that contain multiple rows. Not sure how X affects that. — Rainoa, Apr 04 '21 at 22:59
Added a snippet of the actual data at the bottom. My example would be how to find correlation between "Current Rank" and "Preferred_positions" or "Current Rank" and "Things_you_struggle_most_with". — Rainoa, Apr 04 '21 at 23:09
I asked so because it makes no sense to put a categorical variable into `df.corr()`. All three methods of correlation in the [documentation](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.corr.html) are for numerical variables. I don't understand what are you trying to achieve mathematically. — Bill Huang, Apr 05 '21 at 13:59
Ah okay, feel free to correct me if I am wrong. But casting strings into ints to use correlation, isn't mathematically incorrect right? How else would one try to find correlation between 2 strings, but not correlation with the string value? Like is Y value more or less likely because of X value. — Rainoa, Apr 05 '21 at 14:52

score 4 · Accepted Answer · answered Apr 05 '21 at 18:13

You should compute a frequency table between two categorical variables using pd.crosstab() and perform subsequent analyses based on this table. df.corr(x, y) is NOT mathematically meaningful when one of x and y is categorical, no matter it is encoded into number or not.

N.B.1 If x is categorical but y is numerical, there are two options to describe the linkage between them:

Group y into quantiles (bins) and treat it as categorical
Perform a linear regression of y against one-hot encoded dummy variables of x

Option 2 is more precise in general but the statistics is beyond the scope of this question. This post will focus on the case of two categorical variables.

N.B.2 For sparse matrix output please see this post.

Sample Solution

Data & Preprocessing

import pandas as pd
import io
import matplotlib.pyplot as plt
from seaborn import heatmap

df = pd.read_csv(io.StringIO("""
       year  length  Animation
0      1971     121       1,2,3
1      1939      71       1,3
2      1941       7       0,2
3      1996      70       1,2,0
4      1975      71       3,2,0
"""), sep=r"\s{2,}", engine="python")

# convert string to list
df["Animation"] = df["Animation"].str.split(',')
# expand list column into new rows
df = df.explode("Animation")
# (optional)
df["Animation"] = df["Animation"].astype(int)

Frequency Table

Note: grouping of length is ignored for simplicity

ct = pd.crosstab(df["Animation"], df["length"])

print(ct)
# Out[65]:
# length     7    70   71   121
# Animation
# 0            1    1    1    0
# 1            0    1    1    1
# 2            1    1    1    1
# 3            0    0    2    1

Visualization

ax = heatmap(ct, cmap="viridis",
             yticklabels=df["Animation"].drop_duplicates().sort_values(),
             xticklabels=df["length"].drop_duplicates().sort_values(),
             )
ax.set_title("Title", fontsize=20)
plt.show()

Example Analysis

Based on the frequency table, you can ask questions about the distribution of y given a certain (subset of) x value(s), or vice versa. This should better describe the linkage between two categorical variables, as the categorical variables have no order.

For example,

Q: What length does Animation=3 produces? 

A: 66.7% chance to give 71
   33.3% chance to give 121
   otherwise unobserved

Thanks for the help! I can see now that I used corr() wrong. You explained it all well with helpful links. Marked you as correct answer :) — Rainoa, Apr 05 '21 at 19:08

score 1 · Answer 2 · answered Apr 05 '21 at 01:34

You want to break Animation (or Preferred_positions in your data snippet) up into a series of one-hot columns, one one-hot column for every unique string in the original column. Every column with have values of either zero or one, one corresponding to rows where that string appeared in the original column.

First, you need to get all the unique substrings in Preferred_positions (see this answer for how to deal with a column of lists).

positions = df.Preferred_positions.str.split(',').sum().unique()

Then you can create the positions columns in a loop based on whether the given position is in Preferred_positions for each row.

for position in positions:
    df[position] = df.Preferred_positions.apply(
        lambda x: 1 if position in x else 0
    )