I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook.
import pandas as pandas
import numpy as np
data = pandas.read_csv('data.csv')
That data looks something like this:
import hdbscan
clusterSize = 6
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
And yay! everything seems to work!
So I then want to see some results, so I add these results to my data frame:
data.insert(18,"labels",clusterer.labels_)
data.insert(19,"probabilities",clusterer.probabilities_)
But wait, I have rows with labels for clusters that have probabilities at 0. How does that make sense? Shouldn't any object in a cluster have a probability value > 0? Oh, and all the probabilities are only 0 OR 1.
So I rerun this in Jupyter notebook, specifically, I just rerun
clusterer = hdbscan.HDBSCAN(min_cluster_size=clusterSize).fit(data)
and I check the values for clusterer.labels_
and clusterer.probabilities_
and they are different. Isn't this thing supposed to be consistent? Why would those values change? Is there some hidden state that I'm not told about? But now my clusterer.probabilities_
have values that are between 0 and 1... so that's good right?
So I'm not very familiar with this hdbscan tool obviously, but can someone explain why it gives out different answers when ran multiple times and if probability 0 on a labeled/clustered object makes sense?