Compute several FFT with GPU using Python multiprocessing and pyfft: how to avoid GPU memory leak?

Question

I am trying to implement in Python the following pattern for multi-CPU and single-GPU computation using pycuda and pyfft packages.

I would like to have several processes (e.g. launched with multiprocessing.Pool()), with each of them able to perform FFTs using the GPU (using NVIDIA CUDA).

However, I have the following problem:

If I run too many processes or too many FFTs per process, the overall script remains on hold without terminating (and without computing all the FFTs that are due). From further investigations I suppose this is due to the memory limit on the GPU (currently 2048MB on NVIDIA GeForce GT 750M). It seems that the multiprocessing pool is not able to acquire the control back. Is there any way to avoid this?

Since each process requires less than 2048 MB, I would like to have something like a queue where each process can book the usage of the GPU and, when a process releases the context, the next process in the queue starts using it. Is this doable?

Alternatively, is it possible to force that only one process uses the GPU at a given time?
I have tried separately these solutions but they do not work (or probably I have not implemented them correctly):

synchronize the stream, with proc_stream.synchronize()
clear context cache, with pycuda.tools.clear_context_caches()
change the compute mode, with cuda.compute_mode = cuda.compute_mode.EXCLUSIVE

Note: The solution 2. seems to free some memory, but it makes the computation way slower, and does not solve the problem: e.g. increasing the number of ffts to be computed, the script shows the same behaviour.

Here the code. To start from a simple task, here each process computes 1 FFT (then one can use batch option in execute() to do more FFTs in a row).

import multiprocessing
import pycuda.driver as cuda
import pycuda.gpuarray as gpuarray
from pycuda.tools import make_default_context
from pyfft.cuda import Plan

def main():
    # generates simple matrix, (e.g. image with a signal at the center)
    size = 4096
    center = size/2
    in_matrix = np.zeros((size, size), dtype='complex64')
    in_matrix[center:center+2, center:center+2] = 10.

    pool_size = 4  # integer up to multiprocessing.cpu_count()
    pool = multiprocessing.Pool(processes=pool_size)
    func = FuncWrapper(in_matrix, size)
    nffts = 16  # total number of ffts to be computed
    par = np.arange(nffts)

    results = pool.map(func, par)
    pool.close()
    pool.join()

    print results

And here the function wrapper:

class FuncWrapper(object):
    def __init__(self, matrix, size):
        self.in_matrix = matrix
        self.size = size
        print("Func initialized with matrix size=%i" % size)

    def __call__(self, par):
        proc_id = multiprocessing.current_process().name

        # take control over the GPU
        cuda.init()
        context = make_default_context()
        device = context.get_device()
        proc_stream = cuda.Stream()

        # move data to GPU
        # multiplication self.in_matrix*par is just to have each process computing
        # different matrices
        in_map_gpu = gpuarray.to_gpu(self.in_matrix*par)

        # create Plan, execute FFT and get back the result from GPU
        plan = Plan((self.size, self.size), dtype=np.complex64,
                    fast_math=False, normalize=False, wait_for_finish=True,
                    stream=proc_stream)
        plan.execute(in_map_gpu, wait_for_finish=True)
        result = in_map_gpu.get()

        # free memory on GPU
        del in_map_gpu

        mem = np.array(cuda.mem_get_info())/1.e6
        print("%s free=%f\ttot=%f" % (proc_id, mem[0], mem[1]))

        # release context
        context.pop()

        return par

Now, with nffts=16 and pool_size=4 the script terminates correctly and gives this output:

Func initialized with matrix size=4096
PoolWorker-1 free=1481.019392   tot=2147.024896
PoolWorker-2 free=1331.011584   tot=2147.024896
PoolWorker-3 free=1181.003776   tot=2147.024896
PoolWorker-4 free=1030.631424   tot=2147.024896
PoolWorker-1 free=881.074176    tot=2147.024896
PoolWorker-2 free=731.746304    tot=2147.024896
PoolWorker-3 free=582.418432    tot=2147.024896
PoolWorker-4 free=433.090560    tot=2147.024896
PoolWorker-1 free=582.754304    tot=2147.024896
PoolWorker-2 free=718.946304    tot=2147.024896
PoolWorker-3 free=881.254400    tot=2147.024896
PoolWorker-4 free=1030.684672   tot=2147.024896
PoolWorker-1 free=868.028416    tot=2147.024896
PoolWorker-2 free=731.713536    tot=2147.024896
PoolWorker-3 free=582.402048    tot=2147.024896
PoolWorker-4 free=433.090560    tot=2147.024896
[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15]

But with nffts=18 and pool_size=4 the script does not terminate and gives this output, remaining stuck at the last line:

Func initialized with matrix size=4096
PoolWorker-1 free=1416.392704   tot=2147.024896
PoolWorker-2 free=982.544384    tot=2147.024896
PoolWorker-1 free=1101.037568   tot=2147.024896
PoolWorker-2 free=682.991616    tot=2147.024896
PoolWorker-3 free=815.747072    tot=2147.024896
PoolWorker-4 free=396.918784    tot=2147.024896
PoolWorker-3 free=503.046144    tot=2147.024896
PoolWorker-4 free=397.144064    tot=2147.024896
PoolWorker-1 free=531.361792    tot=2147.024896
PoolWorker-1 free=397.246464    tot=2147.024896
PoolWorker-2 free=518.610944    tot=2147.024896
PoolWorker-2 free=397.021184    tot=2147.024896
PoolWorker-3 free=517.193728    tot=2147.024896
PoolWorker-4 free=397.021184    tot=2147.024896
PoolWorker-3 free=504.336384    tot=2147.024896
PoolWorker-4 free=149.123072    tot=2147.024896
PoolWorker-1 free=283.340800    tot=2147.024896

Many thanks for your help!

Compute several FFT with GPU using Python multiprocessing and pyfft: how to avoid GPU memory leak?

0 Answers0