3

I have the following script (optics.py) to estimate clustering with precomuted distances:

from sklearn.cluster import OPTICS
import numpy as np

distances = np.load(r'distances.npy')
clust = OPTICS(metric='precomputed', n_jobs=-1)
clust = clust.fit(distances)

Looking at htop results I can see that only one CPU core is used

enter image description here

despite the fact scikit runs clustering in multiple processes:

enter image description here

Why n_jobs=-1 has not resulted in using all the CPU cores?

Logica
  • 977
  • 4
  • 16
dzieciou
  • 4,049
  • 8
  • 41
  • 85
  • you can check that: https://joblib.readthedocs.io/en/latest/parallel.html#joblib.parallel_backend – PV8 Jan 16 '20 at 08:48
  • @PV8 How is joblib context related here? The [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.compute_optics_graph.html) for `n_jobs` parameters in OPTICS says: *"The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors."*. I'm not using any `joblib.parallel_backend ` context, so I would expect -1 to using all CPU cores. Unless there's a bug or some constraint. – dzieciou Jan 16 '20 at 08:54
  • 2
    later in that text it is also mentioned that they are still working on it and there is a github link to list all the bugs, I would assume that there is a bug – PV8 Jan 16 '20 at 08:55
  • 1
    [HDBSCAN](https://hdbscan.readthedocs.io/en/latest/index.html) might be a decent alternative in the meantime. Both approaches trying to solve the same problem (DBSCAN with variable cluster densities) and -- at least in my hands -- often give very similar results. The HDBSCAN implementation currently uses [up to 4 cores](https://github.com/scikit-learn-contrib/hdbscan/issues/160). – Paul Brodersen Jan 16 '20 at 12:22

3 Answers3

4

I'm the primary author of the sklearn OPTICS module. Parallelism is difficult because there is an ordering loop which cannot be run in parallel; that said, the most computationally intensive task is distance calculations, and these can be run in parallel. More specifically, sklearn OPTICS calculates the upper triangle distance matrix one row at a time, starting with 'n' distance lookups, and decreasing to 'n-1, n-2' lookups for a total of n-squared / 2 distance calculations... the problem is that parallelism in sklearn is generally handled by joblib, which uses processes (not threads), which have rather high overhead for creation and destruction when used in a loop. (i.e., you create and destroy the process workers per row as you loop through the data set, and 'n' setup/teardowns of processes has more overhead then the parallelism benefit you get from joblib--this is why njobs is disabled for OPTICS)

The best way to 'force' parallelism in OPTICS is probably to define a custom distance metric that runs in parallel-- see this post for a good example of this:

https://medium.com/aspectum/acceleration-for-the-nearest-neighbor-search-on-earths-surface-using-python-513fc75984aa

One of the example's above actually forces the distance calculation onto a GPU, but still uses sklearn for the algorithm execution.

1

I also face this problem. According to some papers (for example this, see abstract), OPTICS is known as challenging to do it in parallel because of its sequential nature. So, probably, sklearn tries to use all cores when you use n_jobs=-1, but there is nothing to run on extra cores.

Probably you should consider other clustering algorithms, which are more parallelism-friendly, for example @paul-brodersen in comments suggests to use HDBSCAN. But it seems that sklearn does not have such parallel alternative for optics, so you need to use other packages.

1

Both OPTICS and HDBSCAN suffer from a lack of parallelization. They both are sequential in nature and thus can't be passed onto a simple joblib.Parallel like DBSCAN can.

If you're looking to improve speed, one of the benefits with HDBSCAN is the ability to create an inference model that you can use to make predictions without having to run the whole cluster again. That's what I use to avoid having to run a very slow cluster operation every time I need to classify my data.

anactualtoaster
  • 396
  • 3
  • 4