I'm trying to process a large volume of data through dbscan
and would love to use all cores available to me on the machine to speed up the computation. I'm using a custom distance metric, but the distance matrix is not precomputed.
I have tried many different implementations, but have not had much success. I've listed them below, along with what I observed when tracking the performance with top
in the terminal window.
- Using the built-in
n_jobs
input:
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, n_jobs=-1)
model.fit(X)
The CPU only hits 2% usage. It looks like only one core of a possible 48 is included in the computation.
- Using the built-in
n_jobs
input with the algorithmBrute
.
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, algorithm=`Brute`, n_jobs=-1)
model.fit(X)
It was suggested to be the only way dbscan
works with parallel processing here: https://github.com/scikit-learn/scikit-learn/pull/8039, although there was a warning, the Brute
could slow it down. The CPU usage hit 100%, but it was not faster.
- Using dask for the parallel processing.
client = Client(processes=False, n_workers=8)
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time)
with parallel_backend('dask'):
model.fit(X)
This type of implementation was suggested here: https://github.com/dask/dask-tutorial/issues/80. However, the CPU utilization remains at 2% suggesting that only one core is being used.
Any suggestions would be much appreciated.