Unable to use distribute LocalCluster in subprocess in python 3

Question

I get an error when using distribute's LocalCluster in a subprocess with python 3 (python 2 works fine). I have the following minimal example (I am using python 3.6, distributed 1.23.3, tornado 5.1.1):

import multiprocessing

from distributed import LocalCluster
from distributed import Client



def call_client(cluster_address):
    with Client(cluster_address):
        pass


def main():
    cluster = LocalCluster(n_workers=2)
    print(cluster.workers)

    process = multiprocessing.Process(
        target=call_client, args=(cluster.scheduler.address, )
    )
    process.start()
    process.join()


if __name__ == "__main__":
    main()

when executing the file I get the following error message:

user@9b97e84a3c58:/workspace$ python test.py
[<Nanny: tcp://127.0.0.1:35779, threads: 2>, <Nanny: tcp://127.0.0.1:40211, threads: 2>]
Process Process-3:
Traceback (most recent call last):
  File "/usr/lib/python3.6/multiprocessing/process.py", line 258, in _bootstrap
    self.run()
  File "/usr/lib/python3.6/multiprocessing/process.py", line 93, in run
    self._target(*self._args, **self._kwargs)
  File "test.py", line 10, in call_client
    with Client(cluster_address):
  File "/home/user/venv/lib/python3.6/site-packages/distributed/client.py", line 610, in __init__
    self.start(timeout=timeout)
  File "/home/user/venv/lib/python3.6/site-packages/distributed/client.py", line 733, in start
    sync(self.loop, self._start, **kwargs)
  File "/home/user/venv/lib/python3.6/site-packages/distributed/utils.py", line 277, in sync
    six.reraise(*error[0])
  File "/home/user/venv/lib/python3.6/site-packages/six.py", line 693, in reraise
    raise value
  File "/home/user/venv/lib/python3.6/site-packages/distributed/utils.py", line 262, in f
    result[0] = yield future
  File "/home/user/venv/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/user/venv/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/user/venv/lib/python3.6/site-packages/distributed/client.py", line 821, in _start
    yield self._ensure_connected(timeout=timeout)
  File "/home/user/venv/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
  File "/home/user/venv/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
    yielded = self.gen.throw(*exc_info)
  File "/home/user/venv/lib/python3.6/site-packages/distributed/client.py", line 862, in _ensure_connected
    self._update_scheduler_info())
  File "/home/user/venv/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
    value = future.result()
tornado.util.TimeoutError: Timeout

The underlying problem seems to be that when starting `LocalCluster` in the main process and going to a subprocess with a fork context the port of the scheduler gets "duplicated" and kind of messed up. — Joerg, Oct 26 '18 at 11:32

score 2 · Accepted Answer · answered Oct 25 '18 at 14:13

2

Using spawn seems to work. I suspect that there is some state that does not fork nicely.

process = multiprocessing.get_context('spawn').Process(...)

answered Oct 25 '18 at 14:13

MRocklin

55,641
23
163
235

Thx MRocklin for the quick response. This solves the problem in the example above. I have the problem that I spawn the subprocess within a flask app and the subprocess uses `from flask import current_app` wich will fail with `RuntimeError: Working outside of application context.` if I use 'spawn' instead of 'fork' for running the subprocess. I will post a solution which works now for me, but is quite complicated – Joerg Oct 26 '18 at 08:59

score 0 · Answer 2 · answered Oct 26 '18 at 09:06

Since my original problem is starting the subprocess within a flask app I can't use 'spawn' as suggested by MRocklin in the other answer. My working solution right now is that I don't call cluster = LocalCluster(n_workers=2) in the main process but also start it in a subprocess:

import sys
import multiprocessing
import signal
from functools import partial

from distributed import LocalCluster
from distributed import Client


def _stop_cluster(cluster, *args):
    cluster.close()
    sys.exit(0)


def _start_local_cluster(q, n_workers):
    cluster = LocalCluster(n_workers=n_workers)
    q.put(cluster.scheduler.address)

    # shut down cluster when process is terminated
    signal.signal(signal.SIGTERM, partial(_stop_cluster, cluster))
    # run forever
    signal.pause()


def call_client(cluster_address):
    with Client(cluster_address):
        print("I am working")


def main():
    q = multiprocessing.Queue()
    p_dask = multiprocessing.Process(target=_start_local_cluster, args=(q, 2))
    p_dask.start()
    cluster_address = q.get()

    process = multiprocessing.Process(
        target=call_client, args=(cluster_address, )
    )
    process.start()
    process.join()

    p_dask.terminate()


if __name__ == "__main__":
    main()

Unable to use distribute LocalCluster in subprocess in python 3

2 Answers2