0

I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.

df =

BucketID Intensity BW25113
825.326 3459870 0.5
825.326 8923429 0.95
734.321 12124 0.4
734.321 2387499 0.3

I originally tried something with the pd.Series.corr() function which is very fast and does what I want it to do to get my final outputs:

bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2


df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3

This then gives me the (mostly) desired table:

BucketID Org correlation
734.321 Intensity 1.0
825.326 Intensity 1.0
734.321 BW25113 -1.0
825.326 BW25113 1.0

This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.

Is there a way to get the p-value associated with pd.Series.corr() in this way or would some version with scipy.stats.pearsonr that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).

Thanks in advance for the assistance and/or comments.

Joe Ferndz
  • 8,417
  • 2
  • 13
  • 33
jhaeckl
  • 1
  • 1

1 Answers1

0

You can use scipy.stats.pearsonr on a dataframe as follows:

df = pd.DataFrame({'col1': [1,2,3,4,5,6,7,8,9,10],
             'col2': [1,2,6,4,5,7,7,8,7,12]})

import scipy
scipy.stats.pearsonr(df['col1'], df['col2'])

Results in a tuple, the first being the correlation and the second value being the p-value.

(0.9049484650760702, 0.00031797789083818853)

Update

for doing this for groups programmatically, you can groupby() then loop through the groups...

df = pd.DataFrame({'group': ['A', 'A', 'B', 'A', 'A', 'B', 'B', 'B', 'A', 'B'],
                   'col1': [1,2,3,4,5,6,7,8,9,10],
                   'col2': [1,2,6,4,5,7,7,8,7,12]})
for group_name, group_data in df.groupby('group'):
    print(group_name, scipy.stats.pearsonr(group_data['col1'], group_data['col2']))

Results in...

A (0.9817469600192116, 0.0029521879612042588)
B (0.8648495371134326, 0.05841898744667266)

These can also be stored in a new df results

results = pd.DataFrame()

for group_name, group_data in df.groupby('group'):
    correlation, p_value = scipy.stats.pearsonr(group_data['col1'], group_data['col2'])
    results = results.append({'group': group_name, 'corr': correlation, 'p_value': p_value},
                             ignore_index=True)
mullinscr
  • 1,668
  • 1
  • 6
  • 14
  • Thanks mullinscr for the answer. I tried that but it doesn't solve my issue of performing that continuously for each BucketID. I tried this in a for loop and it works, but it takes a bit of time depending on the size of the dataframes (tens of thousands of rows). Is there a better way to do that when I have repeat values like BucketID? – jhaeckl Feb 03 '21 at 15:12
  • Can I then save the results of this into three separate columns (group_name, correlation, p_value)? – jhaeckl Feb 04 '21 at 10:37