I was working on Detectron2 for object detection in google colab and It worked successfully, but I had to move to a cluster HPC that uses CentOS 7.4 and Conda. I have already installed all the requirements and at the moment to run the script no errors appear, but it gets stucked in a inifinity sleep loop in the function resume_or_load of the DefaultTrainer class. When I stop it this traceback appears.
Traceback (most recent call last):
File "new_train.py", line 138, in <module>
trainer.resume_or_load(resume=False)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/detectron2/engine/defaults.py",
line 353, in resume_or_load
checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line
215, in resume_or_load
return self.load(path, checkpointables=[])
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line
140, in load
path = self.path_manager.get_local_path(path)
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line
1109, in get_local_path
path, force=force, **kwargs
File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line
764, in _get_local_path
with file_lock(cached):
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-
packages/portalocker/utils.py", line 160, in __enter__
return self.acquire()
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-packages/portalocker/utils.py", line 239, in acquire
for _ in self._timeout_generator(timeout, check_interval):
File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-
packages/portalocker/utils.py", line 152, in _timeout_generator
time.sleep(max(0.001, (i * check_interval) - since_start_time))
KeyboardInterrupt
It was very difficult to trace the error, but I found that specifically the error occurs in the fcntl.flock function. And when I tried this fuction in the same way that detectron in Google Colab and it worked, but in my conda env I see this error
OSError [errno 9]: Bad file descriptor
This error occurs when the script try to download the pre-trained from model_zoo and uses the fcntl.flock() function in a file in my local drive. This function receives an io.Textiowrapper object and describes correctly an exisiting file in my local drive and LockFlags NON_BLOCKING and EXCLUSIVE. I already checked the file permissions and I have the read and write ones.
I have searched but I don't get answers to why it happends, does someone know how i can fix this error?
Thank you so much
PD: Also, I tried it by installing python 3.7.9, 3.7.10 and 3.9.4 and the same error occurs