1

For context, I am trying to compute a pairwise distance matrix using Dynamic Time Warping on a list of spectrograms. All the sound files have different lengths (time steps), but I know their size before starting. The script runs fine sequential but it would take far too long to compute, so I'm trying to parallelize it with joblib.

Let's say I represent them with a list of arrays of type np.float32 (I'll put all the code in the minimal example below). For a stand-in, I'll define the dtw function to create a random matrix and return the value in the last cell (row and column). I optimized it using numba so it runs fairly fast.

import numpy as np
from joblib import Parallel, delayed


# Number of samples
n = 20000

# Generate
x = [np.random.uniform(size=(n, 40)) for n in np.random.randint(low=50, high=500, size=n)]

# Placeholder function
def fake_dtw(a, b):
    mat = np.random.uniform(size=(len(a), len(b)))
    return mat[-1, -1]

# Code to compute pairwise distance
batch_size = 1000
pre_dispatch = 2 * batch_size
with Parallel(n_jobs=-1, batch_size=batch_size, pre_dispatch=pre_dispatch) as p:
    results = p(
        delayed(
            lambda i, j, a, b: (i, j, fake_dtw(a, b))
        )(i, j, x[i], x[j])
        for i in range(1, len(x))
        for j in range(i)
    )

dtw_matrix = np.zeros(shape=(len(x), len(x)))
for i, j, res in results:
    dtw_matrix[i, j] = res
    dtw_matrix[j, i] = res

I have read the documentation as well as this question What batch_size and pre_dispatch in joblib exactly mean. So I know how batch_size and pre_dispatch work, but I can't think of a way to compute proper values to get the best performance.

My question is the following: given

  • the size of all items in the list (which I can compute just before launching)
  • the number of operations (400 millions in this case, since it's all pairs in the 20000 samples)
  • the number of CPUs (I can launch up to 48 workers at once)
  • my computer's RAM (64 GB) Is there a way I can choose batch_size and pre_dispatch so the operations can be computed as fast as possible?

On a dataset about 1/4th the size of my current one I have been able to get away with pre_dispatch='all' and batch_size=(number of operations)/os.cpu_count(), so all the data is distributed at once before running, but it crashes if I try with the current dataset (which I assume is due to memory usage). I tried a few more values, but I was wondering if there's a more principled way of doing this instead of brute forcing and seeing what works.

Thank you in advance!

Killian
  • 51
  • 1
  • 5

1 Answers1

0

I never quite found the answer to the question itself. I did find a workaround; although I'm not sure it's maximally optimized, it runs far faster than before even on smaller data, and completes a run on the 20k dataset in about 4 hours.

So I figured I leave that workaround here for future reference, that is basically using the joblib documentation.

So it turns out that I was copying x to every processes spawned by joblib, which doesn't end well with ~20k spectrograms, never mind more.

According to the documentation, there are a couple workarounds:

  • Using threads with threading instead of the default processes with loky (which I didn't explore because I couldn't manage to make it work with my implementation)
  • Using a numpy memmap to store x as well as the output matrix, which is then defined so all processes can access it. This is what I ended up using and worked well, computing the ~200 million DTW distances in a little over 4 hours (I'll admit I am not entirely sure how it compares to other implementations in Python, never mind C, but none of the implementations I found worked with my problem, either not working on multivariate time-series or not working on a list of arrays of different time lengths)

In case anybody ever happens upon this question, I leave here the code from the original question, updated to use the memmap solution:

import numpy as np
import os
import math
from joblib import Parallel, delayed


# Number of samples
n = 20000

# Generate
x = [np.random.uniform(size=(_, 40)) for _ in np.random.randint(low=50, high=500, size=n)]

# Placeholder function
def fake_dtw(i, j, a, b, output):
    mat = np.random.uniform(size=(len(a), len(b)))[-1, -1]
    output[i, j] = res
    output[j, i] = res

# Dump x to a memmap
memmap_folder = "temp"
x_filename_memmap = f'{memmap_folder}/x_memmap'
dump(x, x_filename_memmap)
x = load(x_filename_memmap, mmap_mode='r')

# Initialise the output matrix as writable memmap accessible by all processes
output_filename_memmap = f'{memmap_folder}/output_memmap'
output = np.memmap(output_filename_memmap, dtype=x[0].dtype, shape=(len(x), len(x)), mode='w+')

# Code to compute pairwise distance
batch_size = math.ceil(n*(n-1)/2 / os.cpu_count())
pre_dispatch = 'all'
with Parallel(n_jobs=-1, batch_size=batch_size, pre_dispatch=pre_dispatch) as p:
    results = p(
        delayed(
            lambda i, j, a, b: (i, j, fake_dtw(a, b))
        )(i, j, x[i], x[j], output)
        for i in range(1, len(x))
        for j in range(i)
    )

# retrieve output as you wish
Killian
  • 51
  • 1
  • 5