Create clusters using correlation matrix in Python

Question

all, I have a correlation matrix of 21 industry sectors. Now I want to split these 21 sectors into 4 or 5 groups, with sectors of similar behaviors grouped together.

Can experts shed me some lights on how to do this in Python please? Thanks much in advance!

sklearn has plenty of [clustering algorithms](http://scikit-learn.org/stable/modules/clustering.html), this forum is more aimed at specific coding problems than general "How do I" questions — G. Anderson, Oct 12 '18 at 21:59
Thanks much, seralouk and Hielka. Could either of you give me a simple example on how to get started pls? I'm not good enough at Python yet. — Jasper C., Oct 12 '18 at 22:00

Wes Doyle · Accepted Answer · 2018-10-12T22:53:40.600

22

You might explore the use of Pandas DataFrame.corr and the scipy.cluster Hierarchical Clustering package

import pandas as pd
import scipy.cluster.hierarchy as spc


df = pd.DataFrame(my_data)
corr = df.corr().values

pdist = spc.distance.pdist(corr)
linkage = spc.linkage(pdist, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist.max(), 'distance')

edited Oct 12 '18 at 22:53

answered Oct 12 '18 at 22:01

Wes Doyle

2,199
3
17
32

Thanks, Wes. I got the correlation matrix in Python, stuck now on how to cluster them into 4 or 5 blocks based upon their correlations. – Jasper C. Oct 12 '18 at 22:04
Great, Wes. This is very helpful. I will start with this. Have a nice weekend! – Jasper C. Oct 12 '18 at 22:14
4

Here is a link to an example use of scipy and pandas that may be of interest: https://github.com/TheLoneNut/CorrelationMatrixClustering/blob/master/CorrelationMatrixClustering.ipynb – Wes Doyle Oct 12 '18 at 22:14
5

What do I do with `idx` once I've obtained it? – Rylan Schaeffer Feb 27 '20 at 16:59
3

Is this right? Surely if a correlation is 0, then the pairwise distance is 0, which is the opposite of what we want? – cjm2671 Jun 06 '20 at 15:33
1

I don't understand the logic behind `pdist(corr)`. Shouldn't `1-corr` be the distance, not Euclidean distance between two rows? – jf328 Nov 25 '20 at 03:04
1

Can you explain the 0.5 * pdist.max() please? – till Kadabra Feb 17 '22 at 14:55
@jf328 Is right I believe: The pdist computes pairwise distance between "observations". However, all observations have already been summarized into the correlation matrix (which is exactly pairwise in nature). – Martijn Courteaux Apr 17 '23 at 17:32
Regarding comment "I don't understand the logic behind pdist(corr). Shouldn't 1-corr be the distance, not Euclidean distance between two rows?": pdist(corr) and pdist(1-corr) are the same thing. Negating a matrix doesn't change the pairwise distances, nor does adding a constant offset. – dslack Jun 26 '23 at 04:41
1

I'd like to repeat @tillKadabra's question: why 0.5 * pdist.max()? Why not 0.4 or 0.8? (Generally it's helpful to avoid "magic numbers" and instead use named variables and comments to make the rationales clear.) Would love to hear your thoughts on this. – dslack Jun 26 '23 at 04:43

score 2 · Answer 2 · answered Apr 18 '23 at 08:20

Okay, @Wes' answer was suggesting to use some good functions for the task, however he used them incorrectly. After some more reading of the documentation, it seems you need a condensed pairwise distance matrix before passing it to the spc.linkage function, which is the upper-triangular part of the distance matrix, row by row.

It also says that the spc.pdist function returns a distance matrix in that condensed form. However, the input is NOT a correlation matrix or anything like that. It needs observations and will turn them into the matrix itself given the specified metric.

Now, it will come to no surprise to you that a covariance or correlation matrix already summarizes observations into a matrix. Instead of representing a distance, it represents correlation. Here is where I am unsure of what is mathematically the most sound thing to do, but I believe you could turn this correlation matrix into a distance matrix of some sort by just calculating 1.0 - corr.

So let's do that:

pdist_uncondensed = 1.0 - corr
pdist_condensed = np.concatenate([row[i+1:] for i, row in enumerate(pdist_uncondensed)])
linkage = spc.linkage(pdist_condensed, method='complete')
idx = spc.fcluster(linkage, 0.5 * pdist_condensed.max(), 'distance')

Create clusters using correlation matrix in Python

2 Answers2

Linked