I am trying to find a way to get the person correlation and p-value between two columns in a dataframe when a third column meets certain conditions.
df =
BucketID | Intensity | BW25113 |
---|---|---|
825.326 | 3459870 | 0.5 |
825.326 | 8923429 | 0.95 |
734.321 | 12124 | 0.4 |
734.321 | 2387499 | 0.3 |
I originally tried something with the pd.Series.corr()
function which is very fast and does what I want it to do to get my final outputs:
bio1 = df.columns[1:].tolist()
pcorrs2 = [s + '_Corr' for s in bio1]
coldict2 = dict(zip(bios,pcorrs2))
coldict2
df2 = df.groupby('BucketID')[bio1].corr(method = 'pearson').unstack()['Intensity'].reset_index().rename(columns = coldict2)
df3 = pd.melt(df2, id_vars = 'BucketID', var_name = 'Org', value_name = 'correlation')
df3['Org'] = df3.Org.apply(lambda x: x.rstrip('_corr'))
df3
This then gives me the (mostly) desired table:
BucketID | Org | correlation |
---|---|---|
734.321 | Intensity | 1.0 |
825.326 | Intensity | 1.0 |
734.321 | BW25113 | -1.0 |
825.326 | BW25113 | 1.0 |
This works for giving me the person correlations but not the p-value, which would be helpful for determining the relevance of the correlations.
Is there a way to get the p-value associated with pd.Series.corr()
in this way or would some version with scipy.stats.pearsonr
that iterates over the dataframe for each BucketID be more efficient? I tried something of this flavor, but it has been incredibly slow (tens of minutes instead of a few seconds).
Thanks in advance for the assistance and/or comments.