For the past few weeks I've been attempting to preform a fairly large clustering analysis using the HDBSCAN
algorithm in python 3.7
. The data in question is roughly 4 million rows by 40 columns at around 1.5GB
in CSV
format. It's a mixture of ints, bools, and floats up to 9 digits.
During this period each time I've been able to get the data to cluster it has taken 3 plus days, which seems weird given HDBSCAN
is revered for its speed and I'm running this on a Google Cloud Compute Instance with 96 cpus. I've spent days trying to get it to utilize the cloud instance's processing power but to no avail.
Using the auto algorithm detection in HDBSCAN
, it selects the boruvka_kdtree as the best algorithm to use. And I've tried passing in all sorts of values to core_dist_n_jobs parameter. From -2,-1, 1, 96, multiprocessing.cpu_count()
, to 300. All seem to have a similar effect of causing 4 main python processes to utilize a full core while spawning way more sleeping processes.
I refuse to believe I'm doing this right and this is truly how long this takes on this hardware. I'm convinced I must be missing something like an issue where using JupyterHub
on the same machine causes some sort of GIL
lock, or I'm missing some parameter for HDBSCAN
.
Here is my current call to HDBSCAN
:
hdbscan.HDBSCAN(min_cluster_size = 100000, min_samples = 500, algorithm='best', alpha=1.0, memory=mem, core_dist_n_jobs = multiprocessing.cpu_count())
I've followed all existing issues and posts related to this issue I could find and nothing has worked so far, but I'm always down to try even radical ideas, because this isn't even the full data I want to cluster and at this rate it would take 4 years years to cluster the full data!