Logs are shown immediately after submitting a job via client on using dask

Question

The logs of the function submitted via the client are immediately displayed. Instead, the logs are expected to be displayed on client.gather(futures). The expected behavior could be achieved using Delayed but not using Futures.

Here is the code to reproduce issue:

from dask.distributed import Client

client = Client(processes=False, n_workers=2)

def inc(x):
    warning(f"{x}")
    return x + 1

output=[]
for x in [1, 2, 3, 4, 5]:
    a = client.submit(inc, x)
    output.append(a)

The above-added code will already display the logs on submission as shown below.

Output:

2022-09-19 20:55:23 ⚡ [root] 1
2022-09-19 20:55:23 ⚡ [root] 2
2022-09-19 20:55:23 ⚡ [root] 3
2022-09-19 20:55:23 ⚡ [root] 4
2022-09-19 20:55:23 ⚡ [root] 5

Output of client.gather(output)

[2, 3, 4, 5, 6]

But it is expected to show only at the execution of client.gather(output) along with the return of the results.

Intended behavior using Dask Delayed:

import dask

@dask.delayed
def inc(x):
    warning(f"{x}")
    return x + 1

data = [1, 2, 3, 4, 5]
output = []
for x in data:
    a = inc(x)
    output.append(a)

total = dask.delayed(output)

total.compute()

Output:

2022-09-19 21:05:01 ⚡ [root] 3
2022-09-19 21:05:01 ⚡ [root] 1
2022-09-19 21:05:01 ⚡ [root] 4
2022-09-19 21:05:01 ⚡ [root] 5
2022-09-19 21:05:01 ⚡ [root] 2
[2, 3, 4, 5, 6]

Could we get the expected behavior using the dask futures?

Short answer is no... `client.submit` and `client.map` do not work the same way as `delayed` - unlike `delayed`, these functions trigger work immediately. There's nothing wrong with scheduling delayed tasks, so I guess my question for you is why can't you just use delayed if you want your tasks to wait for a call to `compute()`? — Michael Delgado, Sep 19 '22 at 21:44
I see. The main intention to ask for the above behavior is because of the additional features available with client-like status. Could we get the status of the task using dask.delayed? I know it sounds a bit stupid, but useful for my better understanding — Roxy, Sep 19 '22 at 21:55
the status of a task is only useful once the computation has started, unless I'm not thinking of something. so I think the answer is no to this as well. did you have a more specific use case in mind we could help you work through? — Michael Delgado, Sep 19 '22 at 21:56
Yeah, I would like to schedule a particular function in a loop, I can't use a map, as the index of the loop would be an input to the function. Once the tasks are scheduled, I get to see the status of the task on the fly and retrieve results once completed. — Roxy, Sep 19 '22 at 22:08
couldn't you do something like `client.map(myfunc, list(range(len(args))), args)`? — Michael Delgado, Sep 19 '22 at 22:14
anyway - yeah I don't really see why you can't do this with client.map/client.submit. if you have a more complete example it sounds like a good question - probably a bit much for comments — Michael Delgado, Sep 19 '22 at 22:16
Thanks for the response. I tried with client.map and the following also triggered the log/print instantaneously. With the feedback from you, I understood that the function is not triggered when we call `gather` but instead works on the client submit/map itself. Is there a way to disable the worker logs on the console and write it to the file instead and make use of the futures instead of delayed? — Roxy, Sep 19 '22 at 22:49
sorry - this is an entirely different question about writing log outputs to files that doesn't have anything to do with dask as far as I can tell (sure - can't you just write a file like normal? or use python's logging module and write to a file). if you have another question, feel free to ask in a new question :) — Michael Delgado, Sep 19 '22 at 23:04

score 1 · Answer 1 · edited Sep 20 '22 at 04:30

your assumption is not correct. Dask futures is a wraper around python base module concurrent.futures.

From the Futures documentation (emphasis added):

This interface is good for arbitrary task scheduling like dask.delayed, but is immediate rather than lazy, which provides some more flexibility in situations where the computations may evolve over time. These features depend on the second generation task scheduler found in dask.distributed (which, despite its name, runs very well on a single machine).

What this basically means is that task computations start exactly at the time they are submitted to the client (not the case for dask delayed). In your case you are storing the futures in a list which forces dask to keep the result of that particular future in memory once is computed and thus, you can call gather on that future to recover them from distributed memory.

As an example, in your case you have 2 workers if you introduce a small delay in your function, you will see that two elements will be printed at the same time once the workers are free the next two tasks will be submitted

import dask
import time

def inc(x):
    warning(f"{x}")
    time.sleep(2)
    return x + 1

output=[]
for x in [1, 2, 3, 4, 5]:
    a = client.submit(inc, x)
    output.append(a)

the output should look something like this

2022-09-19 20:55:23 ⚡ [root] 1
2022-09-19 20:55:23 ⚡ [root] 2
2022-09-19 20:57:23 ⚡ [root] 3
2022-09-19 20:57:23 ⚡ [root] 4
2022-09-19 20:59:23 ⚡ [root] 5

Logs are shown immediately after submitting a job via client on using dask

1 Answers1