1

I have a data flow that mutates then returns the input with Dask:

from dask.distributed import Client
import numpy as np
from typing import List

def update(model: np.ndarray, change: List[float]) -> np.ndarray:
    change = np.asarray(change)
    model -= change.mean()
    return model

def use(model: np.ndarray, val: float) -> float:
    assert np.allclose(model.mean(), val)
    return 1.0

client = Client()
model = np.zeros(10)
model_future = client.scatter(model)

for i in range(10):
    val_future = [client.submit(use, model_future, -i) for _ in range(4)]
    model_future = client.submit(update, model_future, val_future)

print(model_future.result())  # [-10., ...]

Clearly, this works and produces the expected result. However, this example is explicitly warned against in Dask's Best Practices, and mutating the input is warned against in another SO question.

In my use case, copying the input with deepcopy is an expensive operation – including model = deepcopy(model) would double the time of an update call. I'm inclined to avoid running deepcopy on each worker, especially because the example above works.

My question: does the data flow above resolve the issues Dask has with mutating inputs? The mutated input is returned, which I presume helps. When would the example above not produce the expected result?

Scott
  • 2,568
  • 1
  • 27
  • 39
  • The reason that dask discourages mutating input is because is you can create a race condition. In the example you wrote here, you can imagine that you submit your `use` and `update` and they start running asynchronously. It could happen that one of your `update` calls runs before all your `use` calls are finished and then your next call to `use` is now operating on one of your `model` objects that has had its values changed out from underneath via `update` function being called out of expected order. – Brandon Bocklund Dec 11 '20 at 20:13
  • Got it. Race conditions and asynchrony explain why input mutation is so strongly discouraged. This data flow is serial (the outputs of `update`/`use` are inputs to `use`/`serial` respectively), so that's not a concern here. – Scott Dec 11 '20 at 21:28

0 Answers0