0

When I submit a small Tensorflow training as a single task, it launches additional threads. When I press Ctrl+C and raise KeyboardInterrupt my task is closed but underlying threads are not cleaned up and training continues.

Initially, I was thinking that this is a problem of Tensorflow (not cleaning threads), but after testing, I understand that a problem comes from the Dask side, that probably doesn't populate SIGTERM signal further to the task function. My question, how can I set Dask to populate SIGTERM signal to the running task?

Example of desired flow:

Local process -> Press Ctrl + C -> Dask scheduler -> Dask worker -> SIGTERM signal -> Running single task with Tensorflow training.

Thank you.

P.S If you need additional information, just ask.

Update:

Code example:

c = Client('<remote-scheduler>')

def task():
  # tensorflow training
  model = ...
  model.fit(x_train, y_train)

training = c.submit(task)
training.result()

Now, during training, when I press Ctrl+C task is canceled, but tensorflow threads/processes remains.

Update 2: ps -f -u [username] command output.

Dask cluster (1 scheduler, 1 worker, same server), no running tasks:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16550 16547  0 12:40 ?        00:00:00 (sd-pam)
vladysl+ 16805 16311  0 12:40 ?        00:00:00 sshd: vladyslav@pts/45
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:24 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22284 22175  0 12:46 ?        00:00:00 sshd: vladyslav@pts/38
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:03 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:03:48 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 26150 22285  0 12:51 pts/38   00:00:00 ps -f -u vladyslav

During task running:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16547     1  0 12:40 ?        00:00:00 /lib/systemd/systemd --user
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:30 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:07:55 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27079 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

After pressing Ctrl+C, task canceled but tensorflow continues working:

UID        PID  PPID  C STIME TTY          TIME CMD
vladysl+ 16811 16805  0 12:40 pts/45   00:00:00 -bash
vladysl+ 18946 16811  4 12:41 pts/45   00:00:31 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-scheduler --port 42001
vladysl+ 22285 22284  0 12:46 pts/38   00:00:00 -bash
vladysl+ 23138 16811  1 12:48 pts/45   00:00:06 /home/vladyslav/miniconda3/envs/py3.6/bin/python /home/vladyslav/miniconda3/envs/py3.6/bin/dask-worker localhost:42001 --worker-port 420011 --memory-limit $
vladysl+ 23143 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(11)
vladysl+ 23145 23138  0 12:48 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23151 23145 99 12:48 pts/45   00:09:32 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.forkserver import main; main(15, 16, ['distributed'], **{'sys_path': ['/home/vlady$
vladysl+ 23536 23151  0 12:49 pts/45   00:00:00 /home/vladyslav/miniconda3/envs/py3.6/bin/python -c from multiprocessing.semaphore_tracker import main;main(25)
vladysl+ 27117 22285  0 12:54 pts/38   00:00:00 ps -f -u vladyslav

As you can see nothing new appears.

Vladyslav Moisieienkov
  • 4,118
  • 4
  • 25
  • 32
  • 1
    This should probably be an issue https://github.com/dask/distributed/issues. Are you running a local cluster? – Sergei Lebedev Jan 07 '19 at 17:01
  • Distributed (remote) with scheduler and worker. – Vladyslav Moisieienkov Jan 07 '19 at 17:16
  • Looking at the current implementation of Client, it should shutdown itself in `__del__`. You could ensure this is the case by putting it inside the `with` block: https://github.com/dask/distributed/blob/master/distributed/client.py#L1126. Are you launching the workers manually, or via a cluster manager (e.g. YARN)? – Sergei Lebedev Jan 07 '19 at 17:30
  • Manually though command line. What do you mean by with block? – Vladyslav Moisieienkov Jan 07 '19 at 20:16
  • Sorry, could you replace "task" with the correct term: perhaps "process"? A process that has ended has no threads, so I am a little confused by your description. Perhaps you mean the process started more processes? The output of `ps` or similar may be helpful. Or were you wanting to send a shutdown signal to remote processes...? – mdurant Jan 08 '19 at 01:35
  • A task that runs tensorflow code starts more threads (or maybe processes). Tensorflow is able to close all started threads (or processes) but Dask doesn't populate KeyboardInterrupt to the running task. I will update question with code example – Vladyslav Moisieienkov Jan 08 '19 at 07:03
  • @mdurant I will also update the question with `ps` output later today. – Vladyslav Moisieienkov Jan 08 '19 at 07:16
  • @mdurant I think, it's related to how `ThreadPoolExecutor` works. – Vladyslav Moisieienkov Jan 08 '19 at 13:47
  • Then you are on your own! There are articles about passing on signals to processes via subprocess, but I imagine in this case you'll need to interrupt the signal yourself and call the appropriate method of your executor(s) within the workers, using `c.run()`. – mdurant Jan 08 '19 at 14:15
  • @mdurant What do you mean by "on your own"? What is `c.run()`, is it client method? Thanks. – Vladyslav Moisieienkov Jan 10 '19 at 11:48
  • Yes, a documented method on the client. By on your own, I mean, you will have to figure out the right function to use in combination with `c.run()`. – mdurant Jan 10 '19 at 14:20

1 Answers1

0

Dask does not support propagating signals from the client through to workers running tasks.

MRocklin
  • 55,641
  • 23
  • 163
  • 235