2

I have a data set that I group by items (ex. "widgets") and then run a correlation. The result gives a matrix for each widget but since I have a large number of them it makes it very hard to read any exported CSVs.

Is there a simple way to convert the matrix into columns representing the "column vs column" values? This is further confounded by the fact that some of the columns that go into the correlation are definitely not correlated.

I tried the answer from Pandas Correlation Groupby but it keeps the columns as an additional index (at least I think that is the correct terminology as I am new to pandas).

Example Correlation (with made up numbers):

+--------------------------------------+
|  1         Name Color Material MTTF  |
+--------------------------------------+
| Name         1   0.2      0.4 0.01   |
| Color      0.2     1     0.03 0.08   |
| Material   0.4   0.3        1 0.75   |
| MTTF      0.01  0.08     0.75    1   |
+--------------------------------------+
|  2         Name Color Material MTTF  |
+--------------------------------------+
|  ...       Name Color Material MTTF  |
+--------------------------------------+

What I would want for CSV output would be like the following (note I've removed columns where the correlation doesn't make sense):

+-------------------------------------------+
|     Material vs Color    Material vs MTTF |
+-------------------------------------------+
| 1                0.03                0.75 |
| 2                0.15                0.80 |
| ...                                       |
+-------------------------------------------+
Community
  • 1
  • 1
user1601333
  • 151
  • 1
  • 10

1 Answers1

2

It is not clear which columns are relevant. However, you can flatten a correlation matrix as follows:

# Generate random data.
df = pd.DataFrame(np.random.randn(10, 3), columns=list('ABC'))

# Create correlation matrix
corr_matrix = df.corr()
>>> corr_matrix
          A         B         C
A  1.000000  0.052204 -0.442129
B  0.052204  1.000000  0.187938
C -0.442129  0.187938  1.000000

# Flatten correlation matrix.
flat_cm = corr_matrix.stack().reset_index()
flat_cm['A_vs_B'] = flat_cm.level_0 + '_' + flat_cm.level_1
flat_cm.columns = ['A', 'B', 'correlation', 'A_vs_B']    
flat_cm = flat_cm.loc[flat_cm.correlation < 1, ['A_vs_B', 'correlation']]
>>> flat_cm
  A_vs_B  correlation
1    A_B    -0.202226
2    A_C    -0.103136
3    B_A    -0.202226
5    B_C     0.198216
6    C_A    -0.103136
7    C_B     0.198216
Alexander
  • 105,104
  • 32
  • 201
  • 196
  • This gets me part of the way. My end goal is still to re-arrange it so that I get a renamed columns like "A vs B" and "B vs C". The following worked but it seems like a lot of slashing and putting back together so I'm not sure if there is a better way: ``` `col1 = corr_list[(corr_list["level_1"] == 'A') & (corr_list["level_2"] == 'B')].rename(columns={0: "A_vs_B"})` col1 = col1.filter(["Widget Num", "A_vs_B"]) `col2 = corr_list[(corr_list["level_1"] == 'B') & (corr_list["level_2"] == 'C')].rename(columns={0: "B_vs_C"})` `col2 = col2.filter(["Widget Num", "B_vs_C"])` `pd.merge(col1, col2 ...)` – user1601333 Dec 05 '15 at 01:36