0

When training a neural net implemented in Keras in a screen session, I appear to be running into race conditions with Theano.

I proceed as follows. I ssh into the compute cluster I am using (which I am not a root user of).

Then I run:

screen -S model1

Then, once I'm in this screen session, I run the Python script which trains my model. I detach the screen (Ctrl+A+D), and when I do screen -r, everything is fine. However, if I exit my ssh session before I run screen -r, and run screen -r upon logging back in, then I get the following error:

compilelock.py", line 91, in get_lock
  File "~/.local/lib/python2.7/site-packages/theano/gof/compilelock.py", line 275, in lock
OSError: [Errno 13] Permission denied: '~/.theano/compiledir_Linux-3.11--generic-x86_64-with-Ubuntu-13.10-saucy-x86_64-2.7.5+-64/lock_dir'
Error in sys.exitfunc:
Traceback (most recent call last):
  File "/usr/lib/python2.7/atexit.py", line 24, in _run_exitfuncs
    func(*targs, **kargs)
  File "~/.local/lib/python2.7/site-packages/theano/gof/cmodule.py", line 1344, in _on_atexit
  File "/usr/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "~/.local/lib/python2.7/site-packages/theano/gof/compilelock.py", line 54, in lock_ctx
  File "~/.local/lib/python2.7/site-packages/theano/gof/compilelock.py", line 91, in get_lock
  File "~/.local/lib/python2.7/site-packages/theano/gof/compilelock.py", line 275, in lock
OSError: [Errno 13] Permission denied: '~/.theano/compiledir_Linux-3.11--generic-x86_64-with-Ubuntu-13.10-saucy-x86_64-2.7.5+-64/lock_dir'

Does anyone know why this happens? It's interesting that it only happens when I logout and try to run screen -r after logging in.

user19346
  • 303
  • 3
  • 9
  • Not sure why this is tagged with `cuda` so I removed cuda tag. If you think this should be tagged with `cuda` then please explain why and re-tag. Thanks. – Robert Crovella Jul 27 '15 at 23:56

1 Answers1

1

My guess is that your home directory is on a networked filesystem of some kind (e.g. AFS). If so, as soon as you end the session the filesystem security credentials are invalidated and the process, though it continues to run inside the screen, no longer has permission to work with files in the Theano cache directory ~/.theano. If this guess is correct then the problem is not a race condition.

If the problem relates to AFS credential expiry then a solution is to use a credential cache with the kinit command (see the -c option in http://web.mit.edu/kerberos/krb5-1.12/doc/user/user_commands/kinit.html).

Daniel Renshaw
  • 33,729
  • 8
  • 75
  • 94
  • Thanks! That is correct, I am on AFS. From what I gather, I run: 1) kinit -c cache_name me@domain.com 2) ssh me@comp.domain.com But that doesn't seem to work (I still get the same error and each time I do ssh, I still have to type in my password). Is there more to it? I've been reading the documentation, but I am not terribly familiar with using Kerberos (especially for ssh). – user19346 Jul 28 '15 at 18:21
  • The approaches described at http://computing.help.inf.ed.ac.uk/afs-top-ten-tips#Tip07 or http://qwone.com/~jason/useful.html might help. I have the benefit of using a pre-built longjob script so don't know the details myself. – Daniel Renshaw Jul 29 '15 at 09:31