For context, I am trying to compute a pairwise distance matrix using Dynamic Time Warping on a list of spectrograms. All the sound files have different lengths (time steps), but I know their size before starting. The script runs fine sequential but it would take far too long to compute, so I'm trying to parallelize it with joblib.
Let's say I represent them with a list of arrays of type np.float32 (I'll put all the code in the minimal example below). For a stand-in, I'll define the dtw function to create a random matrix and return the value in the last cell (row and column). I optimized it using numba so it runs fairly fast.
import numpy as np
from joblib import Parallel, delayed
# Number of samples
n = 20000
# Generate
x = [np.random.uniform(size=(n, 40)) for n in np.random.randint(low=50, high=500, size=n)]
# Placeholder function
def fake_dtw(a, b):
mat = np.random.uniform(size=(len(a), len(b)))
return mat[-1, -1]
# Code to compute pairwise distance
batch_size = 1000
pre_dispatch = 2 * batch_size
with Parallel(n_jobs=-1, batch_size=batch_size, pre_dispatch=pre_dispatch) as p:
results = p(
delayed(
lambda i, j, a, b: (i, j, fake_dtw(a, b))
)(i, j, x[i], x[j])
for i in range(1, len(x))
for j in range(i)
)
dtw_matrix = np.zeros(shape=(len(x), len(x)))
for i, j, res in results:
dtw_matrix[i, j] = res
dtw_matrix[j, i] = res
I have read the documentation as well as this question What batch_size and pre_dispatch in joblib exactly mean. So I know how batch_size and pre_dispatch work, but I can't think of a way to compute proper values to get the best performance.
My question is the following: given
- the size of all items in the list (which I can compute just before launching)
- the number of operations (400 millions in this case, since it's all pairs in the 20000 samples)
- the number of CPUs (I can launch up to 48 workers at once)
- my computer's RAM (64 GB)
Is there a way I can choose
batch_size
andpre_dispatch
so the operations can be computed as fast as possible?
On a dataset about 1/4th the size of my current one I have been able to get away with pre_dispatch='all'
and batch_size=(number of operations)/os.cpu_count()
, so all the data is distributed at once before running, but it crashes if I try with the current dataset (which I assume is due to memory usage). I tried a few more values, but I was wondering if there's a more principled way of doing this instead of brute forcing and seeing what works.
Thank you in advance!