4

We have a large project that comprises of numerous tasks. We use a dask graph to schedule each task. A small sample of the graph is as follows. Note that dask is set to multiprocessing mode.

dask_graph:

  universe: !!python/tuple [gcsstrategies.svc.business_service.UniverseService.load_universe_object, CONTEXT]
  raw_market_data: !!python/tuple [gcsstrategies.svc.data_loading_service.RDWLoader.load_market_data, CONTEXT, universe]
  raw_fundamental_data: !!python/tuple [gcsstrategies.svc.data_loading_service.RDWLoader.load_fundamental_data, CONTEXT, universe]

dask_keys: [raw_fundamental_data]

Now one of the tasks, raw_fundamental_data, lazily schedules dask tasks using @delay and runs them using dask.compute(). The reason for this design choice is the list of tasks that will be scheduled and lazily run by dask within raw_fundamental_data are dynamically chosen at runtime based on runtime parameters.

The error we see is:

daemonic processes are not allowed to have children

We understand this is because a spawned process is trying to spawn children. Is there any solution to this problem? Does dask have any way to allow a task scheduled via daskgraph to schedule and lazily run its own tasks either using @delay or another method.

Please note that in our system there are numerous tasks that will run their own tasks using multiprocessing. So sequential execution is not an option.

MRocklin
  • 55,641
  • 23
  • 163
  • 235

1 Answers1

5

The multiprocessing scheduler is not capable of this kind of operation. However, the distributed scheduler is (also you can easily use the distributed scheduler on a single machine.

Relevant doc pages are here:

Here is a small example

In [1]: from dask.distributed import Client, local_client

In [2]: def f(n):
   ...:     with local_client() as lc:
   ...:         futures = [lc.submit(lambda x: x + 1, i) for i in range(n)]
   ...:         total = lc.submit(sum, futures)
   ...:         return total.result()
   ...:     

In [3]: c = Client()  # start processes on local machine

In [4]: future = c.submit(f, 10)

In [5]: future.result()
Out[5]: 55

This uses the concurrent.futures interface to dask rather than dask.delayed, but you can use dask.delayed just as well. See http://distributed.readthedocs.io/en/latest/manage-computation.html

MRocklin
  • 55,641
  • 23
  • 163
  • 235
  • Thank you very much for your help. So clean and simple. Your solution worked perfectly. Although I was not able to get the @delayed working so I opted for using your method of using the local_client(). – Samaneh Navabpour Feb 03 '17 at 14:03
  • 1
    @MRocklin Is it a valid use-case if I connect to the same Dask Distributed cluster from inside the delayed/submitted function? (i.e. `lc = Client('127.0.0.1:8786')`) I seem to get a deadlock when I try to do so and use "published" datasets; should I report it? – Vlad Frolov Mar 21 '17 at 20:16
  • 1
    See [distributed.worker_client](http://distributed.readthedocs.io/en/latest/task-launch.html#submit-tasks-from-worker) – MRocklin Mar 21 '17 at 21:11
  • @MRocklin Perfect! Every time I stuck somewhere in Dask, I just need to re-read the documentation as Dask evolves blazingly fast heading ahead of my thoughts :) – Vlad Frolov Mar 22 '17 at 08:32