4

I'm trying to process a large volume of data through dbscan and would love to use all cores available to me on the machine to speed up the computation. I'm using a custom distance metric, but the distance matrix is not precomputed.

I have tried many different implementations, but have not had much success. I've listed them below, along with what I observed when tracking the performance with top in the terminal window.

  1. Using the built-in n_jobs input:
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, n_jobs=-1)
model.fit(X)

The CPU only hits 2% usage. It looks like only one core of a possible 48 is included in the computation.

  1. Using the built-in n_jobs input with the algorithm Brute.
model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time, algorithm=`Brute`, n_jobs=-1)
model.fit(X)

It was suggested to be the only way dbscan works with parallel processing here: https://github.com/scikit-learn/scikit-learn/pull/8039, although there was a warning, the Brute could slow it down. The CPU usage hit 100%, but it was not faster.

  1. Using dask for the parallel processing.
client = Client(processes=False, n_workers=8)
    model = DBSCAN(eps=eps, min_samples=min_samps,metric=distance_sphere_and_time)
    with parallel_backend('dask'):
        model.fit(X)

This type of implementation was suggested here: https://github.com/dask/dask-tutorial/issues/80. However, the CPU utilization remains at 2% suggesting that only one core is being used.

Any suggestions would be much appreciated.

Łukasz D. Tulikowski
  • 1,440
  • 1
  • 17
  • 36
Lauren K
  • 125
  • 6
  • As an update - I was able to increase the CPU utilization by removing the inputs in `Client` and adding `n_jobs=-1` to DBSCAN. I couldn't get it to use all the cores available, but did increase the CPU usage to about 35%. `client = Client()` `model = DBSCAN(eps=eps, min_samples=min_samps, metric=distance_sphere_and_time, n_jobs=-1)` – Lauren K Nov 18 '19 at 14:27

2 Answers2

1

Your problem is python. Try the same in other tools such as ELKI (don't forget to add, e.g., a cover tree index) and you'll see a huge speed difference.

The reason is that your distance is a user function. The ball tree used by sklearn to search for neighbors is Cython, and for every distance computation it needs to go back to the interpreter. It even will copy the point data for every distance computation. These callbacks likely involve the infamous Python GIL, and hence ruin any parallelization efforts.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
0

-Using Dask parallel processing, processes=False tells dask to use multithreading instead of multiprocessing, which isn't recommended for GIL-invoking workflows. (Ref docs: https://docs.dask.org/en/latest/scheduling.html)

So, you can use the following for better performance across all cores: client = Client(processes=True, n_workers=number_of_cpu_cores, threads_per_worker=1)

Where n_workers equals the number of CPU cores you have, and threads_per_worker=1, tells dask to use one-process-with-one-thread per worker. Also note, that processes=True is the default configuration, so you needn't mention it explicitly.

pavithraes
  • 724
  • 5
  • 9