I have a numpy array of coordinates of size n_slice x 2048 x 3, where n_slice is in the tens of thousands. I want to apply the following operation on each 2048 x 3 slice separately
import numpy as np
from scipy.spatial.distance import pdist
# load coor from a binary xyz file, dcd format
n_slice, n_coor, _ = coor.shape
r = np.arange(n_coor)
dist = np.zeros([n_slice, n_coor, n_coor])
# this loop is what I want to parallelize, each slice is completely independent
for i in xrange(n_slice):
dist[i, r[:, None] < r] = pdist(coor[i])
I tried using Dask by making coor
a dask.array
,
import dask.array as da
dcoor = da.from_array(coor, chunks=(1, 2048, 3))
but simply replacing coor
by dcoor
will not expose the parallelism. I could see setting up parallel threads to run for each slice but how do I leverage Dask to handle the parallelism?
Here is the parallel implementation using concurrent.futures
import concurrent.futures
import multiprocessing
n_cpu = multiprocessing.cpu_count()
def get_dist(coor, dist, r):
dist[r[:, None] < r] = pdist(coor)
# load coor from a binary xyz file, dcd format
n_slice, n_coor, _ = coor.shape
r = np.arange(n_coor)
dist = np.zeros([n_slice, n_coor, n_coor])
with concurrent.futures.ThreadPoolExecutor(max_workers=n_cpu) as executor:
for i in xrange(n_slice):
executor.submit(get_dist, cool[i], dist[i], r)
It is possible this problem is not well suited to Dask since there are no inter-chunk computations.