0

I was working on Detectron2 for object detection in google colab and It worked successfully, but I had to move to a cluster HPC that uses CentOS 7.4 and Conda. I have already installed all the requirements and at the moment to run the script no errors appear, but it gets stucked in a inifinity sleep loop in the function resume_or_load of the DefaultTrainer class. When I stop it this traceback appears.

Traceback (most recent call last):
    File "new_train.py", line 138, in <module>
        trainer.resume_or_load(resume=False)
    File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/detectron2/engine/defaults.py", 
    line 353, in resume_or_load
       checkpoint = self.checkpointer.resume_or_load(self.cfg.MODEL.WEIGHTS, resume=resume)
    File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 
    215, in resume_or_load
       return self.load(path, checkpointables=[])
    File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/fvcore/common/checkpoint.py", line 
    140, in load
       path = self.path_manager.get_local_path(path)
    File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line 
    1109, in get_local_path
        path, force=force, **kwargs
    File "/hpcfs/home/mj.patino/.local/lib/python3.7/site-packages/iopath/common/file_io.py", line 
    764, in _get_local_path
        with file_lock(cached):
    File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site- 
    packages/portalocker/utils.py", line 160, in __enter__
        return self.acquire()
    File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site-packages/portalocker/utils.py", line 239, in acquire
        for _ in self._timeout_generator(timeout, check_interval):
    File "/hpcfs/home/mj.patino/.conda/envs/tesisEnv/lib/python3.7/site- 
    packages/portalocker/utils.py", line 152, in _timeout_generator
        time.sleep(max(0.001, (i * check_interval) - since_start_time))
    KeyboardInterrupt

It was very difficult to trace the error, but I found that specifically the error occurs in the fcntl.flock function. And when I tried this fuction in the same way that detectron in Google Colab and it worked, but in my conda env I see this error

OSError [errno 9]: Bad file descriptor

This error occurs when the script try to download the pre-trained from model_zoo and uses the fcntl.flock() function in a file in my local drive. This function receives an io.Textiowrapper object and describes correctly an exisiting file in my local drive and LockFlags NON_BLOCKING and EXCLUSIVE. I already checked the file permissions and I have the read and write ones.

I have searched but I don't get answers to why it happends, does someone know how i can fix this error?

Thank you so much

PD: Also, I tried it by installing python 3.7.9, 3.7.10 and 3.9.4 and the same error occurs

  • What, specifically, is the backend filesystem? (`hpcfs` sounds like it might be something... atypical). – Charles Duffy May 15 '21 at 01:43
  • Can you use the `flock` shell command with files on that same filesystem? – Charles Duffy May 15 '21 at 01:44
  • (Also, make sure the file is actually still open at the time when the error is thrown, meaning that `/proc/self/fd/9` exists from the perspective of the Python process -- you can check that from `pdb`, for example; storing a FD number and trying to use it after the file itself has been closed is the simplest and most obvious way to get that error message, and doesn't require that there be anything weird or different with the host or filesystem). – Charles Duffy May 15 '21 at 01:45
  • ...if we ignore the OSError, hanging trying to `flock` a file typically means _something else already holds a lock on that file_. Figure out what the "something else" is -- that's what OS-level tools like `fuser` are for. – Charles Duffy May 15 '21 at 01:47
  • Yes, the filesystem is hpcfs, the cluster is pretty old, it has a Tesla K40c gpu – Michael Patiño May 15 '21 at 02:28
  • 1
    When I try the command `flock -w 20 filename` with the file in the file system, I got an error `flock: bad number`. This file is empty – Michael Patiño May 15 '21 at 02:30
  • Now that I look into what `hpcfs` is, it's not surprising at all that this fails. You said it was a "local filesystem"; that's completely untrue. https://github.com/dailymotion/hpcfs is a network filesystem built on http shimmed into the kernel with FUSE. Of _course_ it doesn't support real filesystem semantics. – Charles Duffy May 15 '21 at 11:41
  • I see. In that case, what could I do? In conclusion can't I run the script in the cluster? – Michael Patiño May 15 '21 at 21:35
  • Patch the library to use a different, non-filesystem-based locking mechanism? Store the files on a different backend? – Charles Duffy May 15 '21 at 23:49

0 Answers0