2

Description of the problem

The error will occur if the num_workers > 0 , But when I set num_workers = 0 , the error disappeared, though, this will slow down the trainning speed. I think the multiprocessing really matters here .How can I solve this problem?

env

docker python3.8 Pytorch 1.11.0+cu113

error output

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 149, in _serve
    send(conn, destination_pid)
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 50, in send
    reduction.send_handle(conn, new_fd, pid)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 184, in send_handle
    sendfds(s, [handle])
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 149, in sendfds
  File "save_disp.py", line 85, in <module>
    sock.sendmsg([msg], [(socket.SOL_SOCKET, socket.SCM_RIGHTS, fds)])
OSError: [Errno 9] Bad file descriptor

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 151, in _serve
    test()
  File "save_disp.py", line 55, in test
    close()
    for batch_idx, sample in enumerate(TestImgLoader):
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 52, in close
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 530, in __next__
    os.close(new_fd)
OSError: [Errno 9] Bad file descriptor
    data = self._next_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1207, in _next_data
    idx, data = self._get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1173, in _get_data
    success, data = self._try_get_data()
  File "/opt/conda/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1011, in _try_get_data
    data = self._data_queue.get(timeout=timeout)
  File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 116, in get
    return _ForkingPickler.loads(res)
  File "/opt/conda/lib/python3.8/site-packages/torch/multiprocessing/reductions.py", line 295, in rebuild_storage_fd
    fd = df.detach()
  File "/opt/conda/lib/python3.8/multiprocessing/resource_sharer.py", line 58, in detach
    return reduction.recv_handle(conn)
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 189, in recv_handle
    return recvfds(s, 1)[0]
  File "/opt/conda/lib/python3.8/multiprocessing/reduction.py", line 159, in recvfds
    raise EOFError
EOFError
ZhaoAlpha
  • 31
  • 4

0 Answers0