Update
I have found that this problem only occurs when validation starts and validation dataloader is used, I would also be happy to provide any relevant information to solve the problem.
I am currently running a neural network model with video inputs, and I am constantly getting the following error messages when I run the model, however I cannot tell which one is the one I should be attempting to resolve. One fix I've found is to put num_workers=0
, however I would like to know if there is any work around.
Here are the error messages
First the error begins with
/opt/conda/lib/python3.6/site-packages/tqdm/_tqdm.py:476: TqdmMonitorWarning:
tqdm:disabling monitor support (monitor_interval = 0) due to: can't start new thread
Then
Traceback (most recent call last):
[1,3]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 962, in __del__
[1,3]<stderr>: self._shutdown_workers()
[1,3]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 928, in _shutdown_workers
[1,3]<stderr>: self._worker_result_queue.put((None, None))
[1,3]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 87, in put
[1,3]<stderr>: self._start_thread()
[1,3]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/queues.py", line 169, in _start_thread
[1,3]<stderr>: self._thread.start()
[1,3]<stderr>: File "/opt/conda/lib/python3.6/threading.py", line 846, in start
[1,3]<stderr>: _start_new_thread(self._bootstrap, ())
[1,3]<stderr>:RuntimeError: can't start new thread
[1,0]<stderr>:Exception ignored in: <bound method _MultiProcessingDataLoaderIter.__del__ of <torch.utils.data.dataloader.
_MultiProcessingDataLoaderIter object at 0x7f792a06f470>>
And the final message that appears is
Traceback (most recent call last):
[1,1]<stderr>: File "src/tasks/run_ag_qa.py", line 724, in <module>
[1,1]<stderr>: start_training(input_cfg)
[1,1]<stderr>: File "src/tasks/run_ag_qa.py", line 588, in start_training
[1,1]<stderr>: model, val_loader, cfg, global_step)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/autograd/grad_mode.py", line 15, in decorate_context
[1,1]<stderr>: return func(*args, **kwargs)
[1,1]<stderr>: File "src/tasks/run_ag_qa.py", line 232, in validate
[1,1]<stderr>: for val_step, batch in enumerate(val_loader):
[1,1]<stderr>: File "/clipbert/src/datasets/dataloader.py", line 97, in __iter__
[1,1]<stderr>: loader_it = iter(self.loader)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 279, in __iter__
[1,1]<stderr>: return _MultiProcessingDataLoaderIter(self)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/torch/utils/data/dataloader.py", line 719, in __init__
[1,1]<stderr>: w.start()
[1,1]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/process.py", line 105, in start
[1,1]<stderr>: self._popen = self._Popen(self)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 223, in _Popen
[1,1]<stderr>: return _default_context.get_context().Process._Popen(process_obj)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/context.py", line 277, in _Popen
[1,1]<stderr>: return Popen(process_obj)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 19, in __init__
[1,1]<stderr>: self._launch(process_obj)
[1,1]<stderr>: File "/opt/conda/lib/python3.6/multiprocessing/popen_fork.py", line 66, in _launch
[1,1]<stderr>: self.pid = os.fork()
[1,1]<stderr>:BlockingIOError: [Errno 11] Resource temporarily unavailable
[1,1]<stderr>:Error in atexit._run_exitfuncs:
[1,1]<stderr>:Traceback (most recent call last):
[1,1]<stderr>: File "/opt/conda/lib/python3.6/site-packages/tqdm/_monitor.py", line 53, in exit
[1,1]<stderr>: self.join()
[1,1]<stderr>: File "/opt/conda/lib/python3.6/threading.py", line 1051, in join
[1,1]<stderr>: raise RuntimeError("cannot join thread before it is started")
[1,1]<stderr>:RuntimeError: cannot join thread before it is started
I am not sure if the problem is that too many threads are being created, or too many processes are being created by the dataloader, any help is greatly appreciated!