19

all, I have a correlation matrix of 21 industry sectors. Now I want to split these 21 sectors into 4 or 5 groups, with sectors of similar behaviors grouped together.

Can experts shed me some lights on how to do this in Python please? Thanks much in advance!

Jasper C.
  • 379
  • 1
  • 2
  • 9

2 Answers2

22

You might explore the use of Pandas DataFrame.corr and the scipy.cluster Hierarchical Clustering package

import pandas as pd
import scipy.cluster.hierarchy as spc


df = pd.DataFrame(my_data)
corr = df.corr().values

pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')
Wes Doyle
  • 2,199
  • 3
  • 17
  • 32
  • Thanks, Wes. I got the correlation matrix in Python, stuck now on how to cluster them into 4 or 5 blocks based upon their correlations. – Jasper C. Oct 12 '18 at 22:04
  • Great, Wes. This is very helpful. I will start with this. Have a nice weekend! – Jasper C. Oct 12 '18 at 22:14
  • 4
    Here is a link to an example use of scipy and pandas that may be of interest: https://github.com/TheLoneNut/CorrelationMatrixClustering/blob/master/CorrelationMatrixClustering.ipynb – Wes Doyle Oct 12 '18 at 22:14
  • 5
    What do I do with `idx` once I've obtained it? – Rylan Schaeffer Feb 27 '20 at 16:59
  • 3
    Is this right? Surely if a correlation is 0, then the pairwise distance is 0, which is the opposite of what we want? – cjm2671 Jun 06 '20 at 15:33
  • 1
    I don't understand the logic behind `pdist(corr)`. Shouldn't `1-corr` be the distance, not Euclidean distance between two rows? – jf328 Nov 25 '20 at 03:04
  • 1
    Can you explain the 0.5 * pdist.max() please? – till Kadabra Feb 17 '22 at 14:55
  • @jf328 Is right I believe: The pdist computes pairwise distance between "observations". However, all observations have already been summarized into the correlation matrix (which is exactly pairwise in nature). – Martijn Courteaux Apr 17 '23 at 17:32
  • Regarding comment "I don't understand the logic behind pdist(corr). Shouldn't 1-corr be the distance, not Euclidean distance between two rows?": pdist(corr) and pdist(1-corr) are the same thing. Negating a matrix doesn't change the pairwise distances, nor does adding a constant offset. – dslack Jun 26 '23 at 04:41
  • 1
    I'd like to repeat @tillKadabra's question: why 0.5 * pdist.max()? Why not 0.4 or 0.8? (Generally it's helpful to avoid "magic numbers" and instead use named variables and comments to make the rationales clear.) Would love to hear your thoughts on this. – dslack Jun 26 '23 at 04:43
2

Okay, @Wes' answer was suggesting to use some good functions for the task, however he used them incorrectly. After some more reading of the documentation, it seems you need a condensed pairwise distance matrix before passing it to the spc.linkage function, which is the upper-triangular part of the distance matrix, row by row.

It also says that the spc.pdist function returns a distance matrix in that condensed form. However, the input is NOT a correlation matrix or anything like that. It needs observations and will turn them into the matrix itself given the specified metric.

Now, it will come to no surprise to you that a covariance or correlation matrix already summarizes observations into a matrix. Instead of representing a distance, it represents correlation. Here is where I am unsure of what is mathematically the most sound thing to do, but I believe you could turn this correlation matrix into a distance matrix of some sort by just calculating 1.0 - corr.

So let's do that:

pdist_uncondensed = 1.0 - corr
pdist_condensed = np.concatenate([row[i+1:] for i, row in enumerate(pdist_uncondensed)])
linkage = spc.linkage(pdist_condensed, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist_condensed.max(), 'distance')
Martijn Courteaux
  • 67,591
  • 47
  • 198
  • 287