HDBSCAN cluster caching and persistance

Question

HDBSCAN has a flag to cache its cluster data as a param like mentioned below:

prediction_data :boolean, optional

Whether to generate extra cached data for predicting labels or membership vectors few new unseen points later. If you wish to persist the clustering object for later re-use you probably want to set this to True. (default False)

Now I see that at a specifed location, below folder structure is created:

>joblib
...>hdbscan
......>hdbscan_
.........>_hdbscan_boruvka_balltree
............>f1bd5f351764560c3532dbe30f273481
...............metadata.json
...............output.pkl
............func_code.py

As HDBSCAN docs suggest, we can use these files (probably the pickle file) as a persistence store and it can be later re-used for finding cluster labels for new data points. But I don't find a way of doing it.

score 1 · Answer 1 · answered Sep 07 '21 at 05:09

I got here when I was searching for caching memory in HDBSCAN. My original search led me to https://joblib.readthedocs.io/en/latest/auto_examples/memory_basic_usage.html where I found the code below:

from joblib import Memory
location = './cachedir'
memory = Memory(location, verbose=0)

but on using it, I got a

DeprecationWarning: The 'cachedir' parameter has been deprecated in version 0.12 and will 
be removed in version 0.14. 
You provided "cachedir='/tmp/joblib'", use "location='/tmp/joblib'" instead.

Thus, leading to the updated code for caching memory in HDBSCAN using joblib

from joblib import Memory
location='/tmp/joblib'
memory = Memory(location, verbose=0)

score 0 · Answer 2 · answered Aug 30 '20 at 21:14

0

The parameter you want to look at is memory=. If you call HDBSCAN a second time with the same memory= parameter and only change (say) the min_cluster_size explicitly holding the min_samples fixed between runs, then it will save you recompute time.

answered Aug 30 '20 at 21:14

Leland McInnes

316
2
2

1. how do we set the path we want into memory? The documentation talks little about it , if I want to save it under the directory my script is by the name ```results``` for example , if I put ```memory = 'results``` where is this going to be stored? 2. How do we load the model ? There is no info in the documentation, do we use the load() function of joblib? – tonythestark Nov 20 '22 at 18:29

HDBSCAN cluster caching and persistance

2 Answers2

Linked