I'm using the fastcluster package for Python to compute the linkage matrix for a hierarchical clustering procedure over a large set of observations.
So far so good, fastcluster's linkage_vector()
method brings the capability of clustering a much larger set of observations than scipy.linkage()
could compute using the same amount of memory.
With this done, I now want to inspect the clustering results and compute the cophenetic correlation coefficient with respect to the original data. The usual procedure would be to first compute the cophenetic distances matrix and then check the correlation with the original data. Using scipy's cophenet()
method it would look something like this:
import fastcluster as fc
import numpy as np
from scipy.cluster.hierarchy import cophenet
X = np.random.random((1000,10)) # Original data (1000 observations)
Z = fc.linkage_vector(X) # Clustering
orign_dists = fc.pdist(X) # Matrix of original distances between observations
cophe_dists = cophenet(Z) # Matrix of cophenetic distances between observations
# What I really want at the end of the day is
corr_coef = np.corrcoef(orign_dists, cophe_dists)[0,1]
However, this doesn't work when the set of observations is very large (just replace 1000 by 100000 or so and you'll see). Fastcluster's algorithm has no problem with the clustering, but scipy's cophenet()
runs into memory problems with the resulting linkage matrix.
For these cases where the observations set is too big to be handled by the standard scipy function, I don't know of an alternative way of computing the cophenetic correlation offered by fastcluster or any other package out there. Do you? If so, how? If not, can you think of a clever and memory efficient iterative way of achieving that with a customized function? I'm pooling for some ideas here, maybe even the solution.