Parallel inference on single model in CUDA cause worker processes to terminate

Question

I'm trying to start openai\whisper inference on single model in CUDA with multiprocessing.Pool. On 6 workers inference works fine, except of some CUDA warnings on exiting worker processes. On 7 and more workers all works, but at the end (when some workers finished) some workers crashing with "CUDA out of memory" and the same warnings as with 6 processes.

Does anybody know, what's wrong, and is there any way to solve this issue?

I'm using Ubuntu 20.04, GEFORCE RTX 3060, CUDA 12, Pytorch 13, and openai\whisper

When i run program below, with num_proc=6 or less, all works fine, processes finishing tasks:

import os
import dill
import torch
import whisper

from pathlib import Path


def whisper_transcribe(arg, lng= None):
    result = arg[0][0].transcribe(arg[1], task="transcribe", language=lng)
    with open(arg[1][:arg[1].rfind('.')] + '.txt', "w") as res:
        if result["text"] != "":
            res.write(result["text"])
    return 0


def wp_audio():
    lng = "ru"
    if lng == "all":
        lng = None
    
    filenames_list = ['videoplayback_1.mp3','videoplayback_2.mp3','videoplayback_3.mp3','videoplayback_4.mp3','videoplayback_5.mp3','videoplayback_6.mp3','videoplayback_7.mp3','videoplayback_8.mp3','videoplayback_9.mp3','videoplayback_10.mp3']
    lng_list = ['ru', 'ru', 'ru', 'ru', 'ru', 'ru', 'ru', 'ru', 'ru', 'ru']
    
    global model
    
    num_proc = 6
    
    model = dill.load(open("wm_large.dt", 'rb'))
    model.to('cuda')
    model.share_memory()
    
    lst = []
    model_list = [[model]]*num_proc
    for i in range(num_proc):
        lst.append([model_list[i],filenames_list[i]])
        
    torch.multiprocessing.set_start_method('spawn')
    with torch.multiprocessing.Pool(num_proc) as pl:
        pl.map(whisper_transcribe, lst)
    
    del model.encoder, model.decoder, model
    torch.cuda.empty_cache()


if __name__ == '__main__':
    wp_audio()

In spite of this works fine, i get some CUDA warnings.

[W CudaIPCTypes.cpp:95] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

and hundreds of:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice)

with this at the and:

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
warnings.warn('resource_tracker: There appear to be %d '

But when i use num_proc more than 6, some processes closing before their task finished, with this:

multiprocessing.pool.RemoteTraceback: 
"""
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 125, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 48, in mapstar
    return list(map(*args))
  File "/home/user/Projekt/wp_audio_0.2.py", line 24, in whisper_transcribe
    def wp_audio():
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/transcribe.py", line 186, in transcribe
    result: DecodingResult = decode_with_fallback(segment)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/transcribe.py", line 119, in decode_with_fallback
    decode_result = model.decode(segment, options)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/decoding.py", line 712, in decode
    result = DecodingTask(model, options).run(mel)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/autograd/grad_mode.py", line 27, in decorate_context
    return func(*args, **kwargs)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/decoding.py", line 626, in run
    audio_features: Tensor = self._get_audio_features(mel)  # encoder forward pass
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/decoding.py", line 567, in _get_audio_features
    audio_features = self.model.encoder(mel)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/model.py", line 156, in forward
    x = block(x)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/model.py", line 124, in forward
    x = x + self.attn(self.attn_ln(x), mask=mask, kv_cache=kv_cache)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
    return forward_call(*input, **kwargs)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/model.py", line 85, in forward
    wv = self.qkv_attention(q, k, v, mask)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/whisper/model.py", line 99, in qkv_attention
    w = F.softmax(qk.float(), dim=-1).to(q.dtype)
  File "/home/user/Projekt/.venv/lib/python3.10/site-packages/torch/nn/functional.py", line 1841, in softmax
    ret = input.softmax(dim)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 11.76 GiB total capacity; 287.76 MiB already allocated; 205.19 MiB free; 300.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/Projekt/wp_audio_0.2.py", line 60, in <module>
  File "/home/user/Projekt/wp_audio_0.2.py", line 52, in wp_audio
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 367, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.10/multiprocessing/pool.py", line 774, in get
    raise self._value
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 172.00 MiB (GPU 0; 11.76 GiB total capacity; 287.76 MiB already allocated; 205.19 MiB free; 300.00 MiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation.  See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
[W CudaIPCTypes.cpp:95] Producer process tried to deallocate over 1000 memory blocks referred by consumer processes. Deallocation might be significantly slowed down. We assume it will never going to be the case, but if it is, please file but to https://github.com/pytorch/pytorch
[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

with hundreds of this:

[W CUDAGuardImpl.h:46] Warning: CUDA warning: driver shutting down (function uncheckedGetDevice)
[W CUDAGuardImpl.h:62] Warning: CUDA warning: invalid device ordinal (function uncheckedSetDevice)

and this at the end:

/usr/local/lib/python3.10/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

the error with num_proc more than 6 occur only in the end of the program (when some processes are already finished their task)

Parallel inference on single model in CUDA cause worker processes to terminate

0 Answers0