Questions tagged [hdbscan]

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions.

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996.1 It is a density-based clustering algorithm: given a set of points in some space, it groups together points that are closely packed together (points with many nearby neighbors), marking as outliers points that lie alone in low-density regions (whose nearest neighbors are too far away). DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

In 2014, the algorithm was awarded the test of time award (an award given to algorithms which have received substantial attention in theory and practice) at the leading data mining conference, KDD.

81 questions
2
votes
1 answer

how to install HDBSCAN modula, python 3.7, windows 10

I need to use the HDBSCAN algorithme on my data but the module is not installed. I use python 3.7. I am not very familiar with this kind of tricky installations, please, can anyone give me a clear and understandable instructions how to install…
Artashes
  • 102
  • 1
  • 9
2
votes
0 answers

HDBSCAN approximate_predict always returning probability of 0

I am using HDBSCAN to generate prediction data for a given cluster model. I then attempt to classify new points using the approximate_predict function to find the correct cluster for a new point. The model returns the correct cluster for a new point…
James
  • 459
  • 1
  • 5
  • 16
2
votes
0 answers

Reduce spatial data set size using HDBSCAN

I am trying to reduce the spatial data set size by clustering them and finding the center point for the clusters. I referenced to this article (which uses DBSCAN)and it kind of helped except that now the data set size has increased, I am now unable…
M_S_N
  • 2,764
  • 1
  • 17
  • 38
2
votes
4 answers

ERROR: You must give at least one requirement to install -- when running: pip install --upgrade --no-binary hdbscan

I am trying to install hdbscan in my PC which runs Windows 10 and has installed Python 3.6. My first attempt failed: (base) C:\WINDOWS\system32>pip install hdbscan --user Collecting hdbscan Using cached…
user8270077
  • 4,621
  • 17
  • 75
  • 140
2
votes
3 answers

How to evaluate HDBSCAN text clusters?

I'm currently trying to use HDBSCAN to cluster movie data. The goal is to cluster similar movies together (based on movie info like keywords, genres, actor names, etc) and then apply LDA to each cluster and get the representative topics. However,…
J.Doe
  • 529
  • 4
  • 14
2
votes
0 answers

Difference Between OPTICS and HDBSCAN clustering techniques

As a part of my assignment, I have to work on both HDBSCAN and OPTICS clustering technique. I have researched on many sites to identify the difference between these algorithms. All I got was OPTICS algorithm is a slight variation from HDBSCAN. I…
Minu
  • 33
  • 4
2
votes
1 answer

HDBSCAN won't utilize all available cpus. Processes just sleep

For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN algorithm in python 3.7. The data in question is roughly 4 million rows by 40 columns at around 1.5GB in CSV format. It's a mixture of ints,…
1
vote
0 answers

ValueError: Buffer dtype mismatch, expected 'double_t' but got 'float' - hdbscan validity_index

I'm using the validity index in the hdbscan package, which implements DBCV score according to the following paper: https://www.dbs.ifi.lmu.de/~zimek/publications/SDM2014/DBCV.pdf I'm working on a face clustering project, and after using the validity…
1
vote
0 answers

Creating clusters from 3D data through HDBSCAN

I have a problem, I have big data set of 15000 points, those points represent the airplanes over Europe and I have latitudes, longitudes and altitudes. I am trying to create program that will take points from specific country and then create…
Martin Kavka
  • 49
  • 1
  • 3
1
vote
1 answer

HDBSCAN : clustering , persistance and approximate_predict()

I want to cache my model results in order to make predictions without redoing the clustering. I read that I can do that with memory parameter in HDBSCAN. I did that instead because I wanted to save the file in the same directory as my script instead…
tonythestark
  • 519
  • 4
  • 15
1
vote
0 answers

Serving "Frankenstein" (combined) models at scale

I have a tensorflow model that's combined with a clustering algorithm in (HDBSCAN). Both have been trained/fitted separately but they work together (tf -> hdbscan). I'm looking to serve predictions on GCP at scale. Currently, I've created a custom…
bli00
  • 2,215
  • 2
  • 19
  • 46
1
vote
0 answers

HDBSCAN on Movielens Latent embeddings does not cluster well

I am working on a recommendation algorithm, and that has right now boiled down to finding the right clustering algorithm for the job. Data The data I'm working with is the MovieLens 100K dataset, from which I've extracted movie titles, genres and…
1
vote
1 answer

Plot a single cluster

I am working with HDBSCAN and I want to plot only one cluster of the data. This is my current code: import hdbscan import pandas as pd from sklearn.datasets import make_blobs blobs, labels = make_blobs(n_samples=2000, n_features=10) clusterer =…
Cruz
  • 133
  • 12
1
vote
1 answer

How to properly cluster with HDBSCAN for 1D dataset?

My dataset below shows product sales per price (link to download dataset csv): price quantity 0 5098.0 20 1 5098.5 40 2 5099.0 10 3 5100.0 90 4 5100.5 20 .. ... ... 290 5247.0 …
1
vote
1 answer

Explain Behavior of HDBSCAN Clustering

I have a dataset of 6 elements. I computed the distance matrix using Gower distance, which resulted in the following matrix: By just looking at this matrix, I can tell that element #0 is similar to element #4 and #5 the most, so I assumed the…