I've structured this in two sections, BACKGROUND and QUESTION. The Question is all the way at the bottom.
BACKGROUND:
Suppose I want to (using Dask distributed) do an embarrassingly parallel computation like summing 16 gigantic dataframes. I know that this is going to be blazing fast using CUDA but let's please stay with Dask for this example.
A basic way to accomplish this (using delayed) is:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
@delayed
def gen_matrix():
return np.random.rand(1000, 1000)
@delayed
def calc_sum(matrices):
return reduce(lambda a, b: a + b, matrices)
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# Here's the Big Sum
matrices = calc_sum(matrices)
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And here's the dask graph:
This certainly will work, BUT as the size of the matrices (see gen_matrix above) gets too large, the Dask distributed workers start to have three problems:
- They time out sending data to the main worker performing the sum
- The main worker runs out of memory gathering all of the matrices
- The overall sum is not running in parallel (only matrix ganeration is)
Note that none of these issues are Dask's fault, it's working as advertised. I've just set up the computation poorly.
One solution is to break this into a tree computation, which is shown here, along with the dask visualization of that graph:
from functools import reduce
import math
from dask import delayed, compute, visualize
import dask.distributed as dd
import numpy as np
@delayed
def gen_matrix():
return np.random.rand(1000, 1000)
@delayed
def calc_sum(a, b):
return a + b
if __name__ == '__main__':
num_matrices = 16
# Plop them into a big list
matrices = [gen_matrix() for _ in range(num_matrices)]
# This tells us the depth of the calculation portion
# of the tree we are constructing in the next step
depth = int(math.log(num_matrices, 2))
# This is the code I don't want to have to manually write
for _ in range(depth):
matrices = [
calc_sum(matrices[i], matrices[i+1])
for i in range(0, len(matrices), 2)
]
# Go!
with dd.Client('localhost:8786') as client:
f = client.submit(compute, matrices)
result = client.gather(f)
And the graph:
QUESTION:
I would like to be able to get this tree generation done by either a library or perhaps Dask itself. How can I accomplish this?
And for those who are wondering, why not just use the code above? Because there are edge cases that I don't want to have to code for, and also because it's just more code to write :)
I have also seen this: Parallelize tree creation with dask
Is there something in functools or itertools that knows how to do this (and can be used with dask.delayed)?