Questions tagged [hdbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.1 It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.

81 questions
0
votes
2 answers

HDBSCAN: ValueError: numpy.ndarray size changed, may indicate binary incompatibility. Expected 88 from C header, got 80 from PyObject

I try to inititialize HDBSCAN for clustering in JupytherLab. I use Python 3.7.6.. import numpy as np import pandas as pd from sklearn.datasets import load_digits from sklearn.manifold import TSNE import hdbscan There always always appears the…
0
votes
0 answers

Clustering similar lines with HDBSCAN

The image above is a frame from a video. The ultimate goal is to detect the gate. What I want to do is cluster lines similarly to the circles, where the lines that are not circled are outliers. My findings tells me this is a HDBSCAN problem so I…
Luka Jozić
  • 162
  • 2
  • 12
0
votes
1 answer

Can we refit or fit in in parts clustering algorithms?

I want to cluster big data set (more than 1M records). I want to use dbscan or hdbscan algorithms for this clustering task. When I try to use one of those algorithms, I'm getting memory error. Is there a way to fit big data set in parts ? (go…
0
votes
1 answer

hdbscan error when inside rapids container

I am using rapids UMAP in conjunction with HDBSCAN inside a rapidsai docker container : rapidsai/rapidsai-core:0.18-cuda11.0-runtime-ubuntu18.04-py3.7 import cudf import cupy from cuml.manifold import UMAP import hdbscan from sklearn.datasets…
Igna
  • 1,078
  • 8
  • 18
0
votes
1 answer

HDBSCAN Shouldn't any object in a cluster have a probability value > 0? And producing inconsistent results

I am using hdbscan to find clusters within a dataset in a Python Jupyter notebook. import pandas as pandas import numpy as np data = pandas.read_csv('data.csv') That data looks something like this: import hdbscan clusterSize = 6 clusterer =…
Glen Pierce
  • 4,401
  • 5
  • 31
  • 50
0
votes
1 answer

How can I cluster 5 dimensional data using HDBSCAN

I am trying to cluster NTU-RGB+D 120 skeleton dataset using HDBSCAN. The numpy array of the skeleton data has 5 dimention **dataset.shape=[40091, 3, 300, 25, 2]** where No of data = 40091, Coordinates = 3 (x-y-z), No of frame = 300, No of joints =…
Lp81194
  • 79
  • 1
  • 1
  • 10
0
votes
2 answers

HDBSCAN cluster caching and persistance

HDBSCAN has a flag to cache its cluster data as a param like mentioned below: prediction_data :boolean, optional Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist…
Shan
  • 71
  • 1
  • 10
0
votes
1 answer

The same results in DBSCAN and HDBSCAN?

DBSCAN(epsilon, minPts = 2) is related to single linakge clustering and HDBSCAN(minPts = 2) is also related to single linkage clustering. My question is that: how I can obtain the same clustering results with these settings? Or need to set other…
run2you
  • 3
  • 2
0
votes
1 answer

Not able to predict the cluster membership of a new point under hdbscan function available under "dbscan" package

I am using hdbscan function under the package called "dbscan" to perform clustering on a data. I am not able to predict the membership of a new data point after the cluster is built. The predict function works for the object built under dbscan…
0
votes
1 answer

Using callable metric for HDBSCAN*

I want to cluster some data with HDBSCAN*. The distance is calculated as a function of some parameters from both values so if the data look like: label1 | label2 | label3 0 32 18.5 3 1 34.5 11 12 2 .. .. …
Roy Ancri
  • 119
  • 2
  • 14
0
votes
1 answer

Retrieving members of a cluster with HDBSCAN

So I have some string data that I do some manipulations to and then create a cluster with using HDBSCAN: textData = train['eudexHash'].apply(lambda x: str(x)) clusterer = hdbscan.HDBSCAN(min_cluster_size=5, …
0
votes
0 answers

Bizarre HDBScan clustering result for cosine-similarity matrix

I'm trying to cluster similar messages within machine log files (where e.g. I can't ignore numbers). Debugging my code with a subset of messages which all have the same "degree of similarity" I came across a very strange finding: below a certain…
MarkH
  • 122
  • 9
0
votes
1 answer

How to reconstruct an image after clustering with hdbscan?

I am trying to reconstruct a brain tumor image after clustering using hdbscan. However, hdbscan does not have cluster centers unlike kmeans so I am a bit confused on how to obtain the clustered image. I have tried obtaining the ref cluster center…
0
votes
2 answers

How to visualise top terms on each HDBSCAN cluster

I'm currently trying to use HDBSCAN to cluster a bunch of movie data, in order to group similar content together and be able to come up with 'topics' that describe those clusters. I'm interested in HDBSCAN because I'm aware that it's considered soft…
J.Doe
  • 529
  • 4
  • 14
0
votes
2 answers

how to print output results in HDBSCAN

I have ASCII data and i need to cluster the data using HDBSCAN. I got the lables but i don't know how to print the output cluster results i.e unique and segregated results from hdbscan. snippet: import hdbscan import numpy as np datafile =…
vasu
  • 1