I have the following problematic and I am not sure what is happening. I'll explain briefly.
I work on a cluster with several nodes which are managed via slurm. All these nodes share the same disk memory (I think it uses NFS4). My problem is that since this disk memory is shared by a lots of users, we have a limit a mount of disk memory per user.
I use slurm to launch python scripts that runs some code and saves the output to a csv file and a folder.
Since I need more memory than assigned, what I do is I mount a remote folder via sshfs from a machine where I have plenty of disk. Then, I configure the python script to write to that folder via an environment variable, named EXPERIMENT_PATH. The script example is the following:
Python script:
import os
root_experiment_dir = os.getenv('EXPERIMENT_PATH')
if root_experiment_dir is None:
root_experiment_dir = os.path.expanduser("./")
print(root_experiment_dir)
experiment_dir = os.path.join( root_experiment_dir, 'exp_dir')
## create experiment directory
try:
os.makedirs(experiment_dir)
except:
pass
file_results_dir = os.path.join( root_experiment_dir, 'exp_dir' , 'results.csv' )
if os.path.isfile(file_results_dir):
f_results = open(file_results_dir, 'a')
else:
f_results = open(file_results_dir, 'w')
If I directly launch this python script, I can see the created folder and file in my remote machine whose folder has been mounted via sshfs. However, If I use sbatch to launch this script via the following bash file:
export EXPERIMENT_PATH="/tmp/remote_mount_point/"
sbatch -A server -p queue2 --ntasks=1 --cpus-per-task=1 --time=5-0:0:0 --job-name="HOLA" --output='./prueba.txt' ./run_argv.sh "python foo.py"
where run_argv.sh
is a simple bash taking info from argv and launching, i.e. that file codes up:
#!/bin/bash
$*
then I observed that in my remote machine nothing has been written. I can check the mounted folder in /tmp/remote_mount_point/
and nothing appears as well. Only when I unmount this remote folder using: fusermount -u /tmp/remote_mount_point/
I can see that in the running machine a folder has been created with name /tmp/remote_mount_point/
and the file is created inside, but obviously nothing appears in remote machine.
In other words, it seems like by launching through slurm, it bypasses the sshfs mounted folder and creates a new one in the host machine which is only visible once the remote folder is unmounted.
Anyone knows why this happens and how to fix it? I emphasize that this only happens if I launch everything through slurm manager. If not, then everything works.
I shall emphasize that all the nodes in the cluster share the same disk space so I guess that the mounted folder is visible from all machines.
Thanks in advance.