-2

I'm working on a comparison between clustring algorithms and I want to know how HDBSCAN in R calculate the so called the membership 'probability' ?

  • Hello Skoubani. Welcome to Stackoverflow. Which version of `hdbscan()` are you using? Would you please post the code you used to run `hdbscan()`, including the packages you loaded? Also consider reading [how to create a minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), and if you will be regularly posting to the `r tag`, read the [r tag info](https://stackoverflow.com/tags/r/info) page. – Len Greski Mar 29 '21 at 21:51

1 Answers1

2

In the dbscan package, the hdbscan() function does some validity checking of the object passed as input, and then calculates a distance matrix to its k nearest neighbors using the dbscan::kNNdist() function. The value of k is set to the argument minPts that is passed to the dbscan() function less 1.

 core_dist <- kNNdist(x, k = minPts - 1)

It then uses core distance as the measure of density and calculates membership probabilities using the following algorithm (from the hdbscan.R source ):

  ## Generate membership 'probabilities' using core distance as the measure of density
  prob <- rep(0, length(cl))
  for (cid in sl){
    ccl <- res[[as.character(cid)]]
    max_f <- max(core_dist[which(cl == cid)])
    pr <- (max_f - core_dist[which(cl == cid)])/max_f
    prob[cl == cid] <- pr
  }

For each cluster id in the salient clusters object sl, the algorithm calculates the maximum core distance, and then builds probabilities by subtracting each element's distance from the maximum distance, dividing the result by the maximum distance to convert it a proportion.

These coverage probabilities are then inserted into the list that is output by the hdbscan() function as the membership_prob object.

Len Greski
  • 10,505
  • 2
  • 22
  • 33