1

How does dask.delayed handle mutable inputs? explains it with delayed but what should I do if I need the input to be mutated?

I have been trying out Dask and I need to understand how dictionaries are mutated when they are called using distributed.Client.get vs sequentially (normal way).

Sequential

def foo(dictionary):
    dictionary['foo'] = 'foo was called'

def bar(dictionary):
    dictionary['bar'] = 'bar was called'

dictionary = {}

print(dictionary) # {}
foo(dictionary)
print(dictionary) # {'foo': 'foo was called'}
bar(dictionary)
print(dictionary) # {'foo': 'foo was called', 'bar': 'bar was called'}

This works the way I expect it to, the dictionary is mutated and I get two keys after the calls to foo and bar.

Dask

from dask.distributed import Client

client = Client(processes=False)

def foo(dictionary):
    dictionary['foo'] = 'foo was called'

def bar(dictionary):
    dictionary['bar'] = 'bar was called'

dictionary = {}

dsk = {'foo': (foo, dictionary), 'bar':(bar, dictionary)}

client.get(dsk, ['foo', 'bar'])

print(dictionary) # {}

Why is this returning an empty dict? Why is that not mutated? I noticed the dictionary dict has different id(dictionary) inside each functions, so I understand it is a copy.

Is it safe to assume that every function gets its own copy of the objects passed to it? So I can mutate them within the function and have the one at global untouched? Is this understanding correct?

Fiona
  • 25
  • 4
  • Not sure this is a function call:` ```dsk = {'foo': (foo, dictionary), 'bar':(bar, dictionary)}``` you are bulding a dict with tuples of (func, empty dict). – David Meu Mar 22 '21 at 13:44
  • @DavidMeu yes it is not, the functions are called in the next line – Fiona Mar 22 '21 at 13:46
  • Ok so you should check that client get call. – David Meu Mar 22 '21 at 13:47
  • @DavidMeu the functions do run, I have added print and it works – Fiona Mar 22 '21 at 13:47
  • what is returned by ```client.get(dsk, ['foo', 'bar'])```? – David Meu Mar 22 '21 at 13:52
  • @DavidMeu `[None, None]` – Fiona Mar 22 '21 at 13:53
  • I'm not sure regarding dask. But it should pass by reference basically. – David Meu Mar 22 '21 at 13:57
  • By reading some of the docs: https://docs.dask.org/en/latest/graphs.html#don-t-modify-data-in-place It seems it is creating a copy. – David Meu Mar 22 '21 at 14:25
  • If you want to see something else that `[None, None]`, perhaps `return` something from those functions. The expectation of changing the input just doesn't make sense in a distributed context. – tevemadar Mar 22 '21 at 14:28
  • @tevemadar I am aware how that `None` is returned, the question was not about that, I want to know how to mutate a dict or if that is possible, nowhere in the question have I mentioned that getting `None` is the problem, that was response to a different comment – Fiona Mar 22 '21 at 14:33

1 Answers1

1

The short answer: when you pack up a graph and send it to the scheduler, which then sends it to workers, the graph gets serialised and then unserialised. Essentially, it was written with the assumption that the scheduler and workers are in another process or machine. This creates new copies, so mutation has no effect on the original. I believe with pickle5, larger buffer-like objects (e.g., arrays) may be zero-copy.

With the default threaded scheduler, the graph is simply handed to the scheduler and nothing gets copied. This is a far simpler mechanism and a far simpler implementation of the scheduler, but it still has its uses.

To actually mutate objects in place, you would either need to use variables (not meant for large objects, they live on the scheduler), actors (a niche use case) or shared memory. In any case, it would break the normal dask "functional" assumption that the outcome of a task depends on its inputs, and you would need to be careful around cases where a task might be called twice.

mdurant
  • 27,272
  • 5
  • 45
  • 74