I have a data flow that mutates then returns the input with Dask:
from dask.distributed import Client
import numpy as np
from typing import List
def update(model: np.ndarray, change: List[float]) -> np.ndarray:
change = np.asarray(change)
model -= change.mean()
return model
def use(model: np.ndarray, val: float) -> float:
assert np.allclose(model.mean(), val)
return 1.0
client = Client()
model = np.zeros(10)
model_future = client.scatter(model)
for i in range(10):
val_future = [client.submit(use, model_future, -i) for _ in range(4)]
model_future = client.submit(update, model_future, val_future)
print(model_future.result()) # [-10., ...]
Clearly, this works and produces the expected result. However, this example is explicitly warned against in Dask's Best Practices, and mutating the input is warned against in another SO question.
In my use case, copying the input with deepcopy
is an expensive operation – including model = deepcopy(model)
would double the time of an update
call. I'm inclined to avoid running deepcopy
on each worker, especially because the example above works.
My question: does the data flow above resolve the issues Dask has with mutating inputs? The mutated input is returned, which I presume helps. When would the example above not produce the expected result?