Comparing HDBSCAN labels with soft cluster results

Question

I'm getting the soft clusters from a dataset using HDBSCAN as follows:

clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
closest_clusters = [np.argmax(x) for x in soft_clusters]

soft_clusters is a 2D array of the probabilities that a data point belongs to each cluster, so closest_clusters should be an array with the label that the data point is most likely to belong to. However, when I compare closest_clusters with clusterer.labels_ (the label that HDBSCAN assigns the data point), I find that almost none of the clusters match up for the data points that have a label, i.e. a data point with label 3 has 4 as its closest cluster.

I'm not sure if I'm misunderstanding how soft clustering works or if something is wrong with the code. Any help is appreciated!

I don't remember anything like this from the HDBSCAN* paper... what is the theoretical support of this? — Has QUIT--Anony-Mousse, Jul 06 '17 at 06:08
I think it's a relatively new feature. An example is here: http://hdbscan.readthedocs.io/en/latest/soft_clustering.html and an explanation for how it works is here: http://hdbscan.readthedocs.io/en/latest/soft_clustering_explanation.html — Andrew Ng, Jul 06 '17 at 22:13
I'm referring to published, theoretical support for the approach. It's easy to hack up some "fuzzy" variation of HDBSCAN*, but that doesn't mean it is statistically sound. — Has QUIT--Anony-Mousse, Jul 07 '17 at 06:38
This is probably due to this bug: https://github.com/scikit-learn-contrib/hdbscan/issues/123 — gmjonker, Apr 30 '18 at 14:13
Yes, I raised the issue after asking this question—I did a re-mapping based on my findings and it worked for me but based on some of the other comments the behavior doesn't seem to be consistent. — Andrew Ng, May 01 '18 at 18:19
Just ran into this issue. Doesn't seem to have been fixed yet... — Isopycnal Oscillation, May 09 '18 at 23:49

score 3 · Answer 1 · answered Jun 08 '18 at 22:01

The author of HDBSCAN has attempted to fix this problem but it seems that, as it stands, it is simply how it works and there is no way to fix it without some major restructuring. Here is his comment:

Digging in to this I think the answer (unfortunately?) is that this is "just how it works". The soft clustering considers the distance from exemplars, and the merge height in the tree between the point and each of the clusters. These points that end up "wrong" are points that sit on a split in the tree -- they have the same merge height to their own cluster (perhaps that is a bug, I'll look into it further). That means tree-wise we don't distinguish them, and in terms of pure ambient distance to exemplars they are closer to the "wrong" cluster, and so get misclassified. This is a little weird, but the soft clustering is ultimately a little different that the hard clustering, so corner cases like this can theoretically occur.

Comparing HDBSCAN labels with soft cluster results

1 Answers1