0

I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook.

import pandas as pandas
import numpy as np
data = pandas.read_csv('data.csv')

That data looks something like this:

data

import hdbscan
clusterSize = 6
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)

And yay! everything seems to work!

So I then want to see some results, so I add these results to my data frame:

data.insert(18,"labels",clusterer.labels_)
data.insert(19,"probabilities",clusterer.probabilities_)

But wait, I have rows with labels for clusters that have probabilities at 0. How does that make sense? Shouldn't any object in a cluster have a probability value > 0? Oh, and all the probabilities are only 0 OR 1.

So I rerun this in Jupyter notebook, specifically, I just rerun

clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)

and I check the values for clusterer.labels_ and clusterer.probabilities_ and they are different. Isn't this thing supposed to be consistent? Why would those values change? Is there some hidden state that I'm not told about? But now my clusterer.probabilities_ have values that are between 0 and 1... so that's good right?

So I'm not very familiar with this hdbscan tool obviously, but can someone explain why it gives out different answers when ran multiple times and if probability 0 on a labeled/clustered object makes sense?

Glen Pierce
  • 4,401
  • 5
  • 31
  • 50

1 Answers1

0

According to API:

  • labels: Cluster labels for each point in the dataset given to fit(). Noisy samples are given the label -1.
  • probabilities: The strength with which each sample is a member of its assigned cluster. Noise points have probability zero; points in clusters have values assigned proportional to the degree that they persist as part of the cluster.

Therefore probability of zero is meaningful. I was also expecting that the results of different runs on the same data be the same, but it looks like it is not exactly true. According to wiki:

  • DBSCAN is not entirely deterministic: border points that are reachable from more than one cluster can be part of either cluster, depending on the order the data are processed. For most data sets and domains, this situation does not arise often and has little impact on the clustering result:[4] both on core points and noise points, DBSCAN is deterministic. DBSCAN* is a variation that treats border points as noise, and this way achieves a fully deterministic result as well as a more consistent statistical interpretation of density-connected components.

So maybe the selection of a specific algorithm will help to fix the clustering be deterministic.

Shahriar49
  • 621
  • 1
  • 5
  • 18
  • In my case, I'm getting probabilities of 0 on labels that are not -1. – Glen Pierce Nov 22 '20 at 22:16
  • May be it is because your data elements are also only 0/1. Imagine an square. You have four points as (0,0),(0,1),(1,0),(1,1). But really clustering doesn't make sense for a square and you can do it in different ways, all equal. Same for 3-D equivalent (cube). Your data elements in each dimension are only 0/1 and maybe it is not a good fit for a clustering work basically. Have you tries other clustering methods? – Shahriar49 Nov 23 '20 at 02:37
  • or, in other words if you have categorical data (which seems you have), algorithms that are based on Euclidean distance doesn't make much sense. https://towardsdatascience.com/when-clustering-doesnt-make-sense-c6ed9a89e9e6 – Shahriar49 Nov 23 '20 at 02:57
  • K-mode is recommended for categorical data: https://pypi.org/project/kmodes/ – Shahriar49 Nov 23 '20 at 03:04
  • These are all very good points, thank you @Shahriar49 – Glen Pierce Nov 24 '20 at 17:34
  • I should also mention, I have labels that are not -1 which have probabilities 0, Could those be these border cases? – Glen Pierce Nov 24 '20 at 17:35
  • I don't think it theoretically makes sense because if a point is assigned to a cluster (i.e. not a noise point), it should have a positive probability. But maybe it is too small and will be quantized to zero by limited computer precision. Is it happening often? – Shahriar49 Nov 25 '20 at 16:15
  • Do those zero-probability non-noise points belong to a cluster or have a unique cluster label for themselves? – Shahriar49 Nov 25 '20 at 18:01