5

I have a diagonal correlation matrix produced in seaborn. I would like to mask out the ones that have a p-value greater than 0.05.

Here's what I've got https://i.stack.imgur.com/16Rky.jpg

sns.set(style="white")
corr = result.corr()
print corr

mask = np.zeros_like(corr, dtype=np.bool)
mask[np.triu_indices_from(mask)] = True
f, ax = plt.subplots(figsize=(11, 9))
sns_plot = sns.heatmap(result.corr(),mask=mask, annot=True, center=0, square=True, fmt=".1f", linewidths=.5, cmap="Greens")

Would greatly appreciate any help with this. Many thanks

Richard Summers
  • 143
  • 1
  • 10
  • if `a = result.corr()`, then `heatmap(a, mask = mask & (a > 0.05))` ? – ImportanceOfBeingErnest Jul 26 '19 at 19:45
  • Hi ImportanceOfBeingErnest.... It's not quite what I was looking for but it's actually better! So... I'd already masked the top half of the triangle (so only to display the bottom half of the triangle. What your suggestion does, is displays all of the correlations in one half of the triangle, and only p<0.05 correlations in the other! https://imgur.com/a/YziKmGT Just one amendment that I would make to what you said is (a < 0.05)... Note that I have changed mask[np.triu_indices_from(mask)] to mask[np.tril_indices_from(mask)] (mask applied to bottom) – Richard Summers Jul 26 '19 at 20:45
  • Maybe `mask | (a > 0.05)` ? There is no [mcve], so I cannot test anything here. – ImportanceOfBeingErnest Jul 26 '19 at 20:55
  • BOOM! aaaaaah you legend! mask | (a > 0.05) is it... Thanks again. – Richard Summers Jul 26 '19 at 23:50
  • 1
    Sorry to dig this up, however I was searching the same thing. **Just a note** for people searching a way to filter sns heatmap for significant correlation: This does **not filter for p-values** as of the statistical interpretation. Because `.corr()` only gives you the correlation coefficient but no p-value (from a stat test against zero) – Björn Apr 22 '20 at 21:35
  • 1
    I think there was a confusion of p-value and correlation coefficient r – Björn Apr 22 '20 at 23:48

1 Answers1

12

For the sake of completeness, here is a solution that uses scipy.stats.pearsonr (docs) to create a matrix of p-values. Following creating a boolean mask to pass to seaborn (or to additionally combine with numpy np.triu to hide upper triangle of correlations)

def corr_sig(df=None):
    p_matrix = np.zeros(shape=(df.shape[1],df.shape[1]))
    for col in df.columns:
        for col2 in df.drop(col,axis=1).columns:
            _ , p = stats.pearsonr(df[col],df[col2])
            p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p
    return p_matrix

p_values = corr_sig(df)
mask = np.invert(np.tril(p_values<0.05))
# note seaborn will hide correlation were the boolean value is True in the mask


Complete Process with Examples

First off create some sample data (3 correlated variables; 3 uncorrelated ones):

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from scipy import stats

# Simulate 3  correlated variables
num_samples = 100
mu = np.array([5.0, 0.0, 10.0])
# The desired covariance matrix.
r = np.array([
        [  3.40, -2.75, -2.00],
        [ -2.75,  5.50,  1.50],
        [ -2.00,  1.50,  1.25]
    ])
y = np.random.multivariate_normal(mu, r, size=num_samples)
df = pd.DataFrame(y)
df.columns = ["Correlated1","Correlated2","Correlated3"]

# Create two random variables 
for i in range(2):
    df.loc[:,f"Uncorrelated{i}"] = np.random.randint(-2000,2000,len(df))

# To make sure that they are uncorrelated - add also a nearly invariant variables
df.loc[:,"Near Invariant"] = np.random.randint(-99,-95,num_samples)

Plotting function for convenience
Mainly for cosmetics of the heatmap.

def plot_cor_matrix(corr, mask=None):
    f, ax = plt.subplots(figsize=(11, 9))
    sns.heatmap(corr, ax=ax,
                mask=mask,
                # cosmetics
                annot=True, vmin=-1, vmax=1, center=0,
                cmap='coolwarm', linewidths=2, linecolor='black', cbar_kws={'orientation': 'horizontal'})

Corr.-Plot of Example Data with all Correlation
To give you an understanding how the correlations would look like in this exemplary correlation matrix without filtering for significant Correlations(p-Values < .05).

# Plotting without significance filtering
corr = df.corr()
mask = np.triu(corr)
plot_cor_matrix(corr,mask)
plt.show()

enter image description here

Corr.Plot of Example Data with only Sig. Correlations Finally plotting with only significant p-value correlation (alpha < .05)

# Plotting with significance filter
corr = df.corr()                            # get correlation
p_values = corr_sig(df)                     # get p-Value
mask = np.invert(np.tril(p_values<0.05))    # mask - only get significant corr
plot_cor_matrix(corr,mask)  

enter image description here

Conclusion

While in the first correlation-matrix there are some correlation coefficients (r) that are >.05 (filtering as suggested in the comments of the OP), that doesn't imply that the p-value is significant. Thus, it is important to distinguish the p value from the correlation coefficient r.

I hope that this answer will be in future helpful for other searching a way to plot significant correlations with a sns.heatmap

Björn
  • 1,610
  • 2
  • 17
  • 37
  • im getting a ValueError: Mask must have the same shape as data. – prof31 Aug 17 '22 at 15:32
  • Hi, I can not replicate your error. The code still runs fine for me. You probably would need to open a new question. `corr.shape == mask.shape ` If you run this, this should return `TRUE`. In my example, the shape of both (correlation df and mask) should be (6,6) – Björn Aug 18 '22 at 07:36
  • Hi, I get the error message: in corr_sig(df) 4 for col2 in df.drop(col,axis=1).columns: 5 _ , p = stats.pearsonr(df[col],df[col2]) ----> 6 p_matrix[df.columns.to_list().index(col),df.columns.to_list().index(col2)] = p 7 return p_matrix 8 AttributeError: 'Index' object has no attribute 'to_list' – Alan Oct 21 '22 at 16:26