I'm getting the soft clusters from a dataset using HDBSCAN as follows:
clusterer = hdbscan.HDBSCAN(min_cluster_size=10, prediction_data=True)
clusterer.fit(data)
soft_clusters = hdbscan.all_points_membership_vectors(clusterer)
closest_clusters = [np.argmax(x) for x in soft_clusters]
soft_clusters
is a 2D array of the probabilities that a data point belongs to each cluster, so closest_clusters
should be an array with the label that the data point is most likely to belong to. However, when I compare closest_clusters
with clusterer.labels_
(the label that HDBSCAN assigns the data point), I find that almost none of the clusters match up for the data points that have a label, i.e. a data point with label 3 has 4 as its closest cluster.
I'm not sure if I'm misunderstanding how soft clustering works or if something is wrong with the code. Any help is appreciated!