7

Is is possible to select the number of clusters in the HDBSCAN algorithm in python? Or the only way is to play around with the input parameters such as alpha, min_cluster_size?

Thanks

UPDATE: here is the code to use fcluster and hdbscan

import hdbscan
from scipy.cluster.hierarchy import fcluster

clusterer = hdbscan.HDBSCAN()
clusterer.fit(X)
Z = clusterer.single_linkage_tree_.to_numpy()
labels = fcluster(Z, 2, criterion='maxclust')
aL_eX
  • 1,453
  • 2
  • 15
  • 30
user1571823
  • 394
  • 5
  • 20

2 Answers2

5

Thankfully, on June 2020 a contributor on GitHub (Module for flat clustering) provided a commit that adds code to hdbscan that allows us to choose the number of resulting clusters.

To do so:

from hdbscan import flat

clusterer = flat.HDBSCAN_flat(train_df, n_clusters, prediction_data=True)
flat.approximate_predict_flat(clusterer, points_to_predict, n_clusters)

You can find the code here flat.py You should be able to choose the number of clusters for test points using approximate_predict_flat.

In addition, a jupyter notebook has also been written explaining how to use it, Here.

MarMar
  • 180
  • 2
  • 8
1

If you explicitly need to get a fixed number of clusters then the closest thing to managing that would be to use the cluster hierarchy and perform a flat cut through the hierarchy at the level that gives you the desired number of clusters. That does involve working with one of the tree objects that HDBSCAN exposes and getting your hands a little dirty, but it can be done.

Leland McInnes
  • 316
  • 2
  • 2
  • Thanks for you comment. Looking into your suggestion I found that HDBSCAN can be combined with scipy so I can use fcluster in scipy with the criterion 'maxclust' to obtain two clusters! passing the single_linkage_tree_ from HDBSCAN. However, in some cases HDBSCAN does not find two clusters even if the data structure visualy suggest it. I have tried to tune min_samples and min_cluster_size but I dont get the desire result. – user1571823 Jan 23 '18 at 20:50
  • @user1571823 can you add an answer regarding your approach? I tried supplying the single_linkage tree to fcluster but I always get results where the first clusters contains almost all samples, and the rest have exactly one sample each. – Jouni Helske Feb 27 '18 at 13:48
  • 3
    @JouniHelske With HDBSCAN you can do `clusterer.single_linkage_tree_.get_clusters(epsilon_value, min_cluster_size=m)` to get clusters at a cut level of `epsilon_value` and exclude any clusters with less than `m` points. – Leland McInnes Feb 28 '18 at 00:13
  • @LelandMcInnes But if I want fixed k number of clusters I should go through the different values of epsilon and see when I end up with k clusters? – Jouni Helske Feb 28 '18 at 07:39
  • 1
    @JouniHelske you are referring to the same problem I reported in my first comment. It seems to be related to the nature of HDBSCAN. In my case, the dataset has one extreme vector which was causing the non-intuitive clustering structure, i.e. one single vector being a cluster eventhough visually it is clear two dense clusters. Because of this, I moved to fastcluster and its ward linkage fuction. If you also have extreme points in your dataset you could try to exclude those vectors before running HDBSCAN, and assign them to the closest centroid aftterwards. – user1571823 Mar 01 '18 at 08:47
  • 1
    I believe you can use the tools from `scipy.cluster.hierarchy` to extract a flat clustering for a fixed number of clusters. The format of the result of `clusterer.single_linkage_tree_.to_numpy()` can be fed directly to scipy's hierarchical clustering tools. – Leland McInnes Mar 06 '18 at 23:22