9

I have a correlation coefficient matrix (n*n). How to do clustering using the correlation coefficient matrix?

Can I use linkage and fcluster function in SciPy?

Linkage function needs n * m matrix (according to tutorial), but I want to use n*n matrix.

My code is

corre = mp_N.corr()    # mp_N is raw data (m*n matrix)  
Z = linkage(corre, method='average')  # 'corre' is correlation coefficient matrix
fcluster(Z,2,'distance')

Is this code right? If this code is wrong, how can I do clustering with correlation coefficient matrix?

Siny
  • 91
  • 1
  • 1
  • 3
  • With out example data, expected results and returned results, no one can tell if your code is right. Please create a [Minimal, Complete, and Verifiable example](http://stackoverflow.com/help/mcve). Additionally, you may find some more clustering libraries and examples in the [scikit-learn](http://scikit-learn.org/stable/) package. – tmthydvnprt Jul 23 '16 at 14:56

1 Answers1

13

Clustering data using a correlation matrix is a reasonable idea, but one has to pre-process the correlations first. First, the correlation matrix, as returned by numpy.corrcoef, is affected by the errors of machine arithmetics:

  1. It is not always symmetric.
  2. Diagonal terms are not always exactly 1

These can be fixed by taking average with the transpose, and filling the diagonal with 1:

import numpy as np
data = np.random.randint(0, 10, size=(20, 10))   # 20 variables with 10 observations each
corr = np.corrcoef(data)                         # 20 by 20 correlation matrix
corr = (corr + corr.T)/2                         # made symmetric
np.fill_diagonal(corr, 1)                        # put 1 on the diagonal

Second, the input to any clustering method, such as linkage, needs to measure the dissimilarity of objects. The correlation measures similarity. So it needs to be transformed in a way such that 0 correlation is mapped to a large number, while 1 correlation is mapped to 0.

This blog post discusses several ways of such data transformation, and recommends dissimilarity = 1 - abs(correlation). The idea is that strong negative correlation is also an indication that the objects are related, just as positive correlation is. Here is the continuation of the example:

from scipy.cluster.hierarchy import linkage, fcluster
from scipy.spatial.distance import squareform

dissimilarity = 1 - np.abs(corr)
hierarchy = linkage(squareform(dissimilarity), method='average')
labels = fcluster(hierarchy, 0.5, criterion='distance')

Note that we don't feed a full distance matrix into linkage, it needs to be compressed with squareform first.

What exact clustering methods to use, and what thresholds, depends on the context of your problem, there are no universal rules. Often, 0.5 is a reasonable threshold to use for correlation, so I did that. With my 20 sets of random numbers I ended up with 7 clusters: encoded in labels as

[7, 7, 7, 1, 4, 4, 2, 7, 5, 7, 2, 5, 6, 3, 6, 1, 5, 1, 4, 2] 
  • Nice answer! Your mentioning of distance (dissimilarity) and correlation (similarity) is essential to me. – cgsdfc Jun 30 '19 at 00:54
  • 1
    `squareform()` is the key to convert between the dense matrix form and condensed vector form of correlation. And in terms of making correlation a _distance_, let me add that scipy uses `1 - corr`, which has a different consideration as `1 - abs(corr)`. I am not sure which one to use so I decided to follow the way of scipy. – cgsdfc Jun 30 '19 at 00:59
  • 1
    Your link to the blog post is outdated. – cgsdfc Jun 30 '19 at 01:00
  • I'd recommend `squareform(corr, checks=False, force='tovector')` to not check the diagonal elements (since they are discarded) and force the direction of conversion. – cgsdfc Jun 30 '19 at 01:02
  • 2
    What does one do with `labels` to then order and plot the correlation matrix? – Rylan Schaeffer Feb 27 '20 at 17:03
  • @RylanSchaeffer you could try `matrix[labels, :][:, lables]` assuming your labels are sorted – K. W. Cooper Dec 03 '21 at 10:05