2

lets say, I have this type of Hierarchical clustering as below diagram. To get the clustering labels, I need to define proper threshold distance. For example, If I put the threshold at 0.32, I probably would get 3 clusters and if I set around 3.5, I would get 2 clusters from this below diagram.

Instead of using threshold and use some fixed distance, I would like to get the clustering label based on their merging orders.

I would like to define the clustering based on their merging; like first merging, second merging, etc.

For example, here I would like to get clustering labels, when they do at least first merge and that would be 3 clusters;

cluster1: p1
cluster2: p3 and p4
cluster3: p2 and p5.

If I set here, find the clustering when there is at least second merging happens. In this case, I would have 2 clusters such as:

cluster1: p1
cluster2 = p3, p4, p2 and p5.

Does scipy has builtin method to extract this kind of information. If not, is there any way that I can extract this type of information from hierarchical clustering ? Any suggestions would be great.

enter image description here

Example cases:

The idea is that, I don't want to hardcode any threshold limit to define the number of clusters but rather find the clustering based on their merging order. For example, if there is p1, p2 and p3 and at one condition p1 and p2 falls in same cluster at 0.32 and another case, more data is added for p1, p2 and p3 and now they may fall in same clusters but the distance of merging of their clusters may have changed. In such, p1 and p2 are still in same cluster. So, here the distance threshold of defining clusters is irrelevant

user96564
  • 1,578
  • 5
  • 24
  • 42

1 Answers1

1

The linkage matrix produced by the scipy.cluster.hierarchy functions has an extra field for the number of observations in the newly formed cluster:

scipy.cluster.hierarchy.linkage: A (n−1) by 4 matrix Z is returned. At the i-th iteration, clusters with indices Z[i, 0] and Z[i, 1] are combined to form cluster n+i. A cluster with an index less than n corresponds to one of the n original observations. The distance between clusters Z[i, 0] and Z[i, 1] is given by Z[i, 2]. The fourth value Z[i, 3] represents the number of original observations in the newly formed cluster.

I'm not sure I entirely follow your example[1], but you could use the cluster size to define the depth of the cut producing your flat list of clusters, to get something along those lines. The logic could be, for example, "stop at the last merging where cluster size is still 2 or less" (giving your first list of 3 clusters) or "stop at the first merge when the cluster size is 3 or more" (giving your second list of 2 clusters).

Here's an example for a dataset that gives a similar hierarchical clustering as the one shown in your plot, showing results that match your two examples:

import numpy as np
from scipy.cluster.hierarchy import single, fcluster
from scipy.spatial.distance import pdist

X = [
    (0, 0, .45), # P1
    (0, .36, 0), # P2
    (0, 0, 0), # P3
    (.3, 0, 0), # P4
    (.31, .36, 0), # P5
]

Z = single(pdist(X))

i1 = np.argwhere(Z[:,3] <= 2)[-1,0]        # => i1 = 1
d1 = Z[i1, 2]                              # => d = 0.31
c1 = fcluster(Z, d1, criterion='distance') # => c1 = [3, 2, 1, 1, 2]
# i.e., three clusters: {P3, P4}, {P2, P5} and {P1}

i2 = np.argwhere(Z[:,3] >= 3)[0,0]         # => i2 = 2
d2 = Z[i2, 2]                              # => d2 = 0.36
c2 = fcluster(Z, d2, criterion='distance') # => c2 = [2, 1, 1, 1, 1]
# i.e., two clusters: {P2, P3, P4, P5} and {P1}

[1] Wouldn't "at least first merge" be immediately when P3 and P4 are combined, leaving you with 4 clusters? And there's no reason to expect a "second merge" to always combine two pairs: it can also merge a single observation with a pair. That's why I'm suggesting using cluster size rather than "N mergings".

fizzie
  • 651
  • 2
  • 5
  • thanks for the answer, I have to check this solution more carefully. The idea is that, I don't want to hardcode any threshold limit to define the number of clusters but rather find the clustering based on their merging order. For example, if there is p1, p2 and p3 and at one condition p1 and p2 falls in same cluster at 0.32 and another case, more data is added for p1, p2 and p3 and now they may fall in same clusters but the distance of merging of their clusters may have changed. In such, p1 and p2 are still in same cluster. So, here the distance threshold of defining clusters is irrelevant – user96564 Sep 05 '21 at 20:05
  • I don't follow your example here. What do you mean when you say that 'more data is added for p1, p2 and p3'? Do you mean more observations or more variables? In either case why do you want p1 & p2 to remain in the same cluster after obtaining more data? it will always be possible to force any set of points to be in the same cluster by selecting a distance large enough, but this is not necessarily useful. – Ryan Sep 06 '21 at 12:41
  • 1
    I don't think "based on their merging order" is unambiguous. But the example in this answer does *not* hardcode a threshold for distance or number of clusters. Instead, it's based on stopping when (in the process of merging) the newly created cluster first reaches a certain size (in terms of the number of observations in it). By understanding the structure of the linkage matrix (which lists the sequence of merging steps in order) you should be able to implement any other flattening condition that you can clearly define. – fizzie Sep 06 '21 at 19:57
  • @Ryan if new data comes and new cluster is defined, I don't want to check everytime the dendogram and see whether or not, they were still the same clusters and then decide the threshold the distance. The distance of thershold is irrelevant but finding that they are still part of the same clusters is the main goal without checking the dendogram everytime. – user96564 Sep 07 '21 at 11:35