joblib: Choosing batch_size and pre_dispatch to optimize memory usage and performance

Question

For context, I am trying to compute a pairwise distance matrix using Dynamic Time Warping on a list of spectrograms. All the sound files have different lengths (time steps), but I know their size before starting. The script runs fine sequential but it would take far too long to compute, so I'm trying to parallelize it with joblib.

Let's say I represent them with a list of arrays of type np.float32 (I'll put all the code in the minimal example below). For a stand-in, I'll define the dtw function to create a random matrix and return the value in the last cell (row and column). I optimized it using numba so it runs fairly fast.

import numpy as np
from joblib import Parallel, delayed


# Number of samples
n = 20000

# Generate
x = [np.random.uniform(size=(n, 40)) for n in np.random.randint(low=50, high=500, size=n)]

# Placeholder function
def fake_dtw(a, b):
    mat = np.random.uniform(size=(len(a), len(b)))
    return mat[-1, -1]

# Code to compute pairwise distance
batch_size = 1000
pre_dispatch = 2 * batch_size
with Parallel(n_jobs=-1, batch_size=batch_size, pre_dispatch=pre_dispatch) as p:
    results = p(
        delayed(
            lambda i, j, a, b: (i, j, fake_dtw(a, b))
        )(i, j, x[i], x[j])
        for i in range(1, len(x))
        for j in range(i)
    )

dtw_matrix = np.zeros(shape=(len(x), len(x)))
for i, j, res in results:
    dtw_matrix[i, j] = res
    dtw_matrix[j, i] = res

I have read the documentation as well as this question What batch_size and pre_dispatch in joblib exactly mean. So I know how batch_size and pre_dispatch work, but I can't think of a way to compute proper values to get the best performance.

My question is the following: given

the size of all items in the list (which I can compute just before launching)
the number of operations (400 millions in this case, since it's all pairs in the 20000 samples)
the number of CPUs (I can launch up to 48 workers at once)
my computer's RAM (64 GB) Is there a way I can choose batch_size and pre_dispatch so the operations can be computed as fast as possible?

On a dataset about 1/4th the size of my current one I have been able to get away with pre_dispatch='all' and batch_size=(number of operations)/os.cpu_count(), so all the data is distributed at once before running, but it crashes if I try with the current dataset (which I assume is due to memory usage). I tried a few more values, but I was wondering if there's a more principled way of doing this instead of brute forcing and seeing what works.

Thank you in advance!

Killian · Answer 1 · 2022-07-08T07:18:26.653

I never quite found the answer to the question itself. I did find a workaround; although I'm not sure it's maximally optimized, it runs far faster than before even on smaller data, and completes a run on the 20k dataset in about 4 hours.

So I figured I leave that workaround here for future reference, that is basically using the joblib documentation.

So it turns out that I was copying x to every processes spawned by joblib, which doesn't end well with ~20k spectrograms, never mind more.

According to the documentation, there are a couple workarounds:

Using threads with threading instead of the default processes with loky (which I didn't explore because I couldn't manage to make it work with my implementation)
Using a numpy memmap to store x as well as the output matrix, which is then defined so all processes can access it. This is what I ended up using and worked well, computing the ~200 million DTW distances in a little over 4 hours (I'll admit I am not entirely sure how it compares to other implementations in Python, never mind C, but none of the implementations I found worked with my problem, either not working on multivariate time-series or not working on a list of arrays of different time lengths)

In case anybody ever happens upon this question, I leave here the code from the original question, updated to use the memmap solution:

import numpy as np
import os
import math
from joblib import Parallel, delayed


# Number of samples
n = 20000

# Generate
x = [np.random.uniform(size=(_, 40)) for _ in np.random.randint(low=50, high=500, size=n)]

# Placeholder function
def fake_dtw(i, j, a, b, output):
    mat = np.random.uniform(size=(len(a), len(b)))[-1, -1]
    output[i, j] = res
    output[j, i] = res

# Dump x to a memmap
memmap_folder = "temp"
x_filename_memmap = f'{memmap_folder}/x_memmap'
dump(x, x_filename_memmap)
x = load(x_filename_memmap, mmap_mode='r')

# Initialise the output matrix as writable memmap accessible by all processes
output_filename_memmap = f'{memmap_folder}/output_memmap'
output = np.memmap(output_filename_memmap, dtype=x[0].dtype, shape=(len(x), len(x)), mode='w+')

# Code to compute pairwise distance
batch_size = math.ceil(n*(n-1)/2 / os.cpu_count())
pre_dispatch = 'all'
with Parallel(n_jobs=-1, batch_size=batch_size, pre_dispatch=pre_dispatch) as p:
    results = p(
        delayed(
            lambda i, j, a, b: (i, j, fake_dtw(a, b))
        )(i, j, x[i], x[j], output)
        for i in range(1, len(x))
        for j in range(i)
    )

# retrieve output as you wish

joblib: Choosing batch_size and pre_dispatch to optimize memory usage and performance

1 Answers1