7

I am running some parallel workers using Pytorch, CUDA, and torch.multiprocessing (torch.mp), with info being passed around with torch.mp Queues, Pipes, and shared_memory as appropriate. Everything appears to work, but I occasionally get a deallocation warning from CUDA. During the program exit procedure a "process termination before tensors released" warning also appears:

[W CudaIPCTypes.cpp:92] Producer process tried to deallocate over 1000 memory blocks 
referred by consumer processes. Deallocation might be significantly slowed down. We 
assume it will never going to be the case, but if it is, please file but to 
https://github.com/pytorch/pytorch

[W CudaIPCTypes.cpp:15] Producer process has been terminated before all shared CUDA 
tensors released. See Note [Sharing CUDA tensors]

I have taken steps to give each process an opportunity to gracefully shutdown, but the warning persists. I've read in the documentation that when using CUDA the info provided through Queues must remain present on the producer process until it is no longer present on the consumer. Since data is passed back-and-forth, each process could be considered both a producer and a consumer. Do I need to track down everything that would be shared and manually delete the consumer copies during the exit procedure?

The above warnings are given completely without context and despite adding various print statements to help with debugging it is still not clear which part of the code is causing them. Is there a way to make the warnings point more clearly to what's causing the issue?

Also, I have tried implementing warn_with_traceback per this stack overflow question, but it doesn't affect the warning messages, even when added to each worker process.

import traceback
import warnings
import sys

def warn_with_traceback(message, category, filename, lineno, file=None, line=None):

    log = file if hasattr(file,'write') else sys.stderr
    traceback.print_stack(file=log)
    log.write(warnings.formatwarning(message, category, filename, lineno, line))

warnings.showwarning = warn_with_traceback
talonmies
  • 70,661
  • 34
  • 192
  • 269
Mandias
  • 742
  • 5
  • 17

0 Answers0