5

I'm using the fastcluster package for Python to compute the linkage matrix for a hierarchical clustering procedure over a large set of observations.

So far so good, fastcluster's linkage_vector() method brings the capability of clustering a much larger set of observations than scipy.linkage() could compute using the same amount of memory.

With this done, I now want to inspect the clustering results and compute the cophenetic correlation coefficient with respect to the original data. The usual procedure would be to first compute the cophenetic distances matrix and then check the correlation with the original data. Using scipy's cophenet() method it would look something like this:

import fastcluster as fc
import numpy as np
from scipy.cluster.hierarchy import cophenet

X = np.random.random((1000,10))  # Original data (1000 observations)
Z = fc.linkage_vector(X)         # Clustering

orign_dists = fc.pdist(X)  # Matrix of original distances between observations
cophe_dists = cophenet(Z)  # Matrix of cophenetic distances between observations

# What I really want at the end of the day is
corr_coef = np.corrcoef(orign_dists, cophe_dists)[0,1]

However, this doesn't work when the set of observations is very large (just replace 1000 by 100000 or so and you'll see). Fastcluster's algorithm has no problem with the clustering, but scipy's cophenet() runs into memory problems with the resulting linkage matrix.

For these cases where the observations set is too big to be handled by the standard scipy function, I don't know of an alternative way of computing the cophenetic correlation offered by fastcluster or any other package out there. Do you? If so, how? If not, can you think of a clever and memory efficient iterative way of achieving that with a customized function? I'm pooling for some ideas here, maybe even the solution.

PDRX
  • 1,003
  • 1
  • 11
  • 15
  • You say it "doesn't work." Do you mean it takes too long, or it crashes, or it gives wrong results? Would you please look at the source code (https://github.com/scipy/scipy/blob/v0.16.1/scipy/cluster/hierarchy.py#L935) and execute each line of `cophenet()` individually and tell us which line causes the problem? Or if it's running out of memory or crashing, post the full traceback. – John Zwinck May 10 '17 at 02:24
  • It's a MemoryError. The problem lies in here: `dm = np.zeros((m * (m - 1)) // 2, dtype=np.double)`. – PDRX May 10 '17 at 02:33
  • So with Z.shape[0] being 100000, that's 5 billion doubles, which is 40 GB of data. How much RAM do you have? It would be worth trying this on a machine with 128 GB or so and see if it works, and if so, how long it takes. What is the shape of orign_dists in the 100000 case? – John Zwinck May 10 '17 at 02:52
  • Well, actually I just realised the error is raised before arriving at `cophenet()`. I think `fc.pdist()` is calling `scipy.spacial.distance.pdist()` and the MemoryError happens at line 1377, in `dm = np.zeros((m*(m-1))//2,dtype=np.double)`. This also happens at `cophenet()` if I change ordering of orign_dists and cophe_dists, this time at line 1154 in `scipy.cluster.hierarchy`, in cophenet, `zz = np.zeros((n * (n-1)) // 2, dtype=np.double)`. The machine I'm using has +200Gb and it's crashing, but the way I would like this to work it needs to be scalable to Z.shape[0] ~ 1e7... – PDRX May 10 '17 at 03:33
  • OK, so try running it with the largest input that works. What then is the shape of orign_dists, cophe_dists, Z, and X? How much memory does it use? – John Zwinck May 10 '17 at 04:04
  • The *MemoryError* I showed before was raised while running the code in my PC. Running it on the cluster I can go up to X.shape=(45000,10), X.nbytes=3600000, which is Z.shape=(44999,4), Z.nbytes=1439968. As for (shape,nbytes) for orign_dists and cophe_dists it is the same: ((1012477500,), 8099820000). It does it but it's throwing me *VisibleDeprecationWarning: using a non-integer number instead of an integer will result in an error in the future*. When trying with X.shape=(50000,10) it crashes with *Segmentation fault*. This is Python 2.7. – PDRX May 10 '17 at 13:15

0 Answers0