18

Is there a way to limit the number of cores used by the default threaded scheduler (default when using dask dataframes)?

With compute, you can specify it by using:

df.compute(get=dask.threaded.get, num_workers=20)

But I was wondering if there is a way to set this as the default, so you don't need to specify this for each compute call?

The would eg be interesting in the case of a small cluster (eg of 64 cores), but which is shared with other people (without a job system), and I don't want to necessarily take up all cores when starting computations with dask.

joris
  • 133,120
  • 36
  • 247
  • 202

1 Answers1

21

You can specify a default ThreadPool

from multiprocessing.pool import ThreadPool
import dask
dask.config.set(pool=ThreadPool(20))
MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • 2
    ThreadPool(20) sets the number of processes to 20. Is there a way to restrict the number of threads per process? Handling many threads in a single process can produce unnecessary overhead. – Andy R Aug 22 '19 at 20:42
  • @AndiR Thats a wrong assumption. `multiprocessing.pool.ThreadPool` is a pool of *threads*, and not the same as `multiprocessing.Pool`. See https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.ThreadPool – Sebastian Hoffmann Mar 22 '21 at 16:48