Python multiprocessing throws Killed: 9

Question

I am trying to use multiprocessing to speed up a function where I tile 2000 arrays of shape (76, 76) into 3D arrays and apply a scaling factor.

It works fine when the number of tiles is less than about 200 but I get a Killed: 9 when it's greater than that and I need to be able to handle on order of 1000 tiles.

Here's a simplified version of the code:

from functools import partial
from multiprocessing.pool import ThreadPool
from multiprocessing import cpu_count
import numpy as np

def func_A(data, scale, N):
    """Tile the data N times and scale it"""
    arr = np.tile(data, (N, 1, 1))
    arr *= scale
    return arr

def func_B(N=4):
    """Create scaled arrays"""
    # Make data
    data = np.random.normal(size=(2000, 76, 76))

    # Make scales
    scales = np.arange(2000)

    # Multiprocess into tiled arrays
    pool = ThreadPool(cpu_count())
    func = partial(func_A, N=N)
    inpt = list(zip(data, scales))
    results = np.asarray(pool.starmap(func, inpt), dtype=np.float64)
    pool.close()
    pool.join()

    return results.swapaxes(0, 1)

So it's fine for func_B(4) but dies for func_B(500).

I understand I am taxing Python's memory with such large arrays but what is the best way to get func_B to work with large N... preferably quickly? Am I using multiprocessing wrong? Should I be using something else altogether, e.g. Dask, Numba, Cython, etc?

Any help would be greatly appreciated. Thanks!

When I execute `func_B(500)`, I end up with `MemoryError: Unable to allocate 43.0 GiB for an array with shape (2000, 500, 76, 76) and data type float64` - perhaps try to break down the execution of this code into smaller chunks that you offload/save and then piece together downstream? Others might have good suggestions in this regard — PeptideWitch, May 12 '20 at 04:05
I strongly suggest Numba, it does all the heavy-lifting. You don't have to waste time on mutiprocessing, a weaker solution. — Massimo, May 12 '20 at 07:16
Could you provide official solutions I can approve? I’m not too sure how I would implement those. Thanks! — Joe Flip, May 12 '20 at 14:10
Can you maybe tell what happens with the data afterwards? Because the size of the arrays is to big to hold them all in ram at once. Is it ok to write to a file? Then you can work it in chunks. Is there something happening afterwards that reduces the arrays again? — HrkBrkkl, May 14 '20 at 14:06
Yes, there is a step afterward. Once I make the 2000 data cubes, I add them all up along axis 0 to make a single data cube of shape (N, 76, 76). It's fine to write them to file and chunk it. — Joe Flip, May 14 '20 at 17:55
is it possible to have results type in float16/float32 format ? — Marco Cerliani, May 15 '20 at 18:17
Apart from splitting the data into chunks you should really use a `Process` pool rather than a `Thread` pool as your workload seems to be CPU bound. — Ionut Ticus, May 15 '20 at 18:57
numpy usually used OpenBLAS as the backend. You can check this with `np.show_config()`. OpenBLAS is already multithreaded via OpenMP so it is possible that you gain nothing from mutliprocessing even if you somehow bypassed OutOfMemory issue — tstanisl, May 15 '20 at 19:02

score 2 · Answer 1 · answered May 15 '20 at 20:22

I think that the most intuitive solution to override the memory problem is to work with float16 arrays. I try to rewrite all the process in more simple way (func_C)

### your method ###

def func_A(data, scale, N):
    """Tile the data N times and scale it"""
    arr = np.tile(data, (N, 1, 1))
    arr *= scale
    return arr

def func_B(N=4):
    """Create scaled arrays"""
    # Make data
    data = np.random.normal(size=(2000, 76, 76)).astype(np.float16) ###### set float16

    # Make scales
    scales = np.arange(2000).astype(np.float16) ###### set float16

    # Multiprocess into tiled arrays
    pool = ThreadPool(cpu_count())
    func = partial(func_A, N=N)
    inpt = list(zip(data, scales))
    results = np.asarray(pool.starmap(func, inpt), dtype=np.float16) ###### set float16
    pool.close()
    pool.join()

    return results.swapaxes(0, 1)

### alternative method ###

def func_C(N=4):

    scales = np.arange(2000).astype(np.float16)
    data = np.random.normal(size=(2000, 76, 76)).astype(np.float16)
    results = np.stack(N*[data*scales[:,None,None]])

    return results

CHECK RESULTS

np.random.seed(33)
a = func_B(10)
np.random.seed(33)
b = func_C(10)
(a == b).all() # ===> TRUE

CHECK PERFORMANCES

Bonus points for no new dependencies, but the process still gets killed at high N>1000. Thanks for the answer! — Joe Flip, May 18 '20 at 03:23

score 2 · Accepted Answer · answered May 15 '20 at 20:54

I am not entirely sure what the purpose of your calculation is, but the following appears to do the job in dask

import dask.array as da
import numpy as np

# Make data
data = da.random.normal(size=(2000, 76, 76), chunks=(2000, 76, 76))

# Make scales
scales = np.arange(2000)
N = 500
out = da.repeat(data, N, axis=0).reshape((N, 2000, 76, 76)) * scales.reshape((1, 2000, 1, 1))
out = out.sum(axis=0).compute()

Keeps working memory <~5GB and uses most of your cores.

This works great and is simple to implement! It takes about 3 seconds for N=3000 which is perfect. — Joe Flip, May 18 '20 at 19:44

amzon-ex · Answer 3 · 2020-05-16T12:03:46.167

So here are my observations after a gruelling bout with the task:

The final array you're supposed to obtain as output, is 4-D and has the shape (2000, 2000, 76, 76) and is made of float64 type values. A crude calculation shows that the size of this array is: 2000*2000*76*76*8 bytes = ~170 GBs... so you definitely cannot hold all of it in memory at once.
Usage of multiprocessing is complicated (has always been for someone who has not studied multiprocessing rigorously) and it yields not-so-good computation times. For instance, on Google Colab, (Tesla T4 GPU backend, 12GB RAM) N = 50 takes ~4.5 s (minimum) to run. A better implementation with the module may be possible, but I'm not the one for it.

My course of action:

To tackle the second issue, I use cupy, which is supposed to be a drop-in replacement for numpy in Python. By drop-in replacement, it means you can replace numpy with cupy everywhere in your code (there are exceptions - irrelevant to this problem). cupy however uses CUDA on Nvidia GPUs, and thus you need to make a CUDA installation before doing a cupy installation. (Check this guide.) Or, if it is possible, you may prefer online computing resources, like I do with Google Colab.
I also divide the work into parts. I use a function fnh(a, scale, N) to compute the scaled-tiled arrays for arbitrary N.
I slice the intended output array into multiple parts, and iteratively run fnh(...) on these slices. The slicing can be tuned for better optimisation, but I've just used something based on a crude guess.

Here is the code:

import cupy as cp


def fnh(a, scale, N):
    arr = cp.einsum('i,ijk->ijk', scale, a)
    result = cp.tile(arr, (N, 1, 1, 1))

    del arr
    return result


def slicer(arr, scales, N = 400):
    mempool = cp.get_default_memory_pool()
    pinned_mempool = cp.get_default_pinned_memory_pool()
    # result = np.empty((N, 2000, 76, 76))    # to large to be allocated

    section = 500                             # Choices subject
    parts = 80                                # to optimization
    step = N // parts

    for i in range(parts):                    # Slice N into equal parts
        begin = i*step
        end = begin + step

        stacked = cp.empty((step, 2000, 76, 76))

        for j in range(2000 // section):      # Section the 2000 arrays into equal parts
            begin = j*section
            end = begin + section

            s = scales[begin:end]
            a = arr[begin:end]
            res = fnh(a, s, step)
            stacked[:, begin:end] = res       # Accumulate values

            del a, res

        # result[begin:end] = stacked         # This is where we were supposed to 
                                              # accumulate values in result
        del stacked
        mempool.free_all_blocks()
        pinned_mempool.free_all_blocks()

First of all, I use cupy.einsum to compute the scaling vectorially on the arrays.

Second, I delete arrays wherever possible to recover space. Specifically, it is imperative to deallocate space alloted by cupy in the GPU memory pool using mempool.free_all_blocks() and pinned_mempool.free_all_blocks(), to recover available GPU memory. Read about it here. However, since cupy caches the allocated memory, it might(?) be helpful to use this caching in a limited manner for some speedup. (This is a hunch and I'm not particularly informed about it.) So I use the same memory for the sectioned tiles, and clear it after completing a N-slice.

Third, where # result[begin:end] = stacked is, you're supposed to off-load the array somehow; because you cannot afford to have the entire array in memory, as previously mentioned. Offloading to some bin where and how you see fit to your application is probably a good way to avoid the memory problem.

Fourth, this code is incomplete. This is because the formed arrays need proper handling, as previously mentioned. But it does the major heavy-lifting.

Lastly, to time this code using timeit, in Google Colab:
For comparison, N = 50 takes ~50 ms (minimum) and N = 2000 takes ~7.4 s (minimum) to run.

UPDATE: Changing to parts = 40 and section = 250 brings down the minimum time to ~6.1 s.

Well, I'm sure there will be better ways to write this code, and I'm looking forward to it!

This works really well too! I'm more inclined to use `dask` than `cupy` and this code is a little more obfuscated than I'd prefer but great approach to this problem. Thanks! — Joe Flip, May 18 '20 at 19:45

Python multiprocessing throws Killed: 9

3 Answers3