Questions tagged [hdbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.1 It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.

81 questions
1
vote
0 answers

Measuring "single strongest peak" in a distribution

I'd like to automatically detect whether data have a very strongly discernable peak, with any particular distribution. The data can otherwise be quite noisy, or there might be several 'false' peaks. Here are a few examples of the performance I'd…
1
vote
1 answer

HDBSCAN Cluster choice

I have been working with HDBSCAN and have a few hundreds of clusters based on my data. I am trying to select some cluster groups for further analysis. Looking for the clusters which have high inter-cluster-distance, as in more spread out and behave…
Jazz
  • 445
  • 2
  • 7
  • 22
1
vote
1 answer

Python HDBScan class always fails on second iteration before even entering first function

I am attempting to look at conglomerated outlier information, utilizing several different SKLearn, HDBScan, and custom outlier detection classes. However, for some reason I am consistently running into an error where any class utilizing HDBScan…
WolVes
  • 1,286
  • 2
  • 19
  • 39
1
vote
1 answer

Anomalies Detection by DBSCAN

I am using DBSCAN on my training datatset in order to find outliers and remove those outliers from the dataset before training model. I am using DBSCAN on my train rows 7697 with 8 columns.Here is my code from sklearn.cluster import DBSCAN X =…
1
vote
1 answer

How to extract clusters from HDBSCAN algorithm

I'd like to extract original points that form each cluster, I know that HDBSCAN doesn't have cluster centers , so I thought in case each label corresponds to the original point at the same order, I can do the following but the results are really bad…
user11936452
1
vote
2 answers

Cluster a list of geographic points by distance and constraints

I have a delivery app, and I want to group orders (each order has a lat and lng coordinates) by location proximity (linear distance) and constraints like max orders and max total products (each order has an amount of products) inside a group. For…
Alex
  • 1,033
  • 4
  • 23
  • 43
1
vote
0 answers

How to find top terms in dbscan or hdbscan clusters?

I'm using dbscan from sklearn and HDBSCAN to cluster some documents. vectorizer = TfidfVectorizer(stop_words=mystopwords) X = vectorizer.fit_transform(y) dbscan = DBSCAN(eps=0.75, min_samples = 9) clusters = dbscan.fit_predict(X) Now how can I get…
1
vote
2 answers

dealing with noise in hdbscan

I have been testing hdbscan from the scikit learn package with a small instance of (x,y) points "point_coord" and the resulting clusters do not really make sense to me. Given the small size of the sample, I am allowing a single cluster. I would…
Mike
  • 375
  • 1
  • 4
  • 14
1
vote
0 answers

Printing a Python-generated plot in R

I am working on performing a HDBSCAN, and am performing the analysis using the hdbscan python module within R. I have the following code: library(reticulate) hdb <- import("hdbscan") # Import hdbscan Python library # Create dummy data. My actual…
kneijenhuijs
  • 1,189
  • 1
  • 12
  • 21
1
vote
1 answer

How to know to which matrix row corresponds each cluster label?

After doing clustering I end up with an object which stores all the cluster labels, something like this: clusterer.labels_ The above is typically a list or an array. Then I always assign the labels to the original pandas dataframe (dataset) like…
tumbleweed
  • 4,624
  • 12
  • 50
  • 81
0
votes
0 answers

Error with UMAP: "ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types numpy.dtype[float32]"

I'm trying to use UMAP for dimensionality reduction on some embeddings. However, I encounter the following error when my dataset has more than 5k rows: ufunc 'correct_alternative_cosine' did not contain a loop with signature matching types…
0
votes
0 answers

scikit-learn HDBscan throws error when trying to compute medoids/centroids

I have a precomputed distance matrix that I want to find the medoids for. According to the scikit-learn docs, there's a parameter and attribute that you have to set and call in order to retrieve these medoids. When I set the parameter…
0
votes
0 answers

Top2Vec model returning TypeError: 'numpy.float64' object cannot be interpreted as an integer

I'm trying to train a top2vec model and come up against either the issue of not having enough documents which I rectify by concatenating the dataframe with itself etc. Then upon training the model the Type Error comes up. I can't find where the…
Magnetar
  • 85
  • 8
0
votes
1 answer

HDBSCAN doesn't work anymore - 'float' object cannot be interpreted as an integer

I'm running HDBSCAN for weeks now on gene expression datasets and everything went perfectly well, but lately it refuses to run : clusterer = hdbscan.HDBSCAN(min_cluster_size=10, min_samples=1).fit(df) TypeError: 'float' object cannot…
Nozelar
  • 1
  • 1
0
votes
0 answers

HDBSCAN clusters sentence embeddings in one cluster that are way too far apart

I have the task to cluster utterances to a chatbot based on sentence similarity in order to find out which are topics users ask about and how important those topics are. I am converting the utterances into sentence embeddings using the…