Slurm Job is Running out of Memory [RAM?] but memory limit not reached

Question

I run simulations on a hpc-cluster which are quite memory demanding. I'm fitting cmdstan models with 3000 iterations for different conditions(200 unique combinations). To do this, I'm using the simDesign package in R.

The simulations run perfectly fine, with outputs as expected when I run it with a low number of replications (e.g. 10). For testing, I now wanted to run one condition-row with 100 reps (this will be the real case). But after approx. 1 hour, my node runs out of memory:

sh: /rds2874z4733/temp/ no space left on device
sh: /rds2874z4733/temp/ no space left on device

When I monitor my job after canceling, I see that the allocated memory is not yet depleted (even though it would not be a sufficient amount of memory in the end):

State: CANCELLED (exit code 0)
Nodes: 1
Cores per node: 64
CPU Utilized: 3-06:35:52
CPU Efficiency: 93.73% of 3-11:51:28 core-walltime
Job Wall-clock time: 01:18:37
Memory Utilized: 33.82 GB
Memory Efficiency: 39.13% of 86.43 GB

I also tried to allocate more memory and more ram for my node, but this does not solve the problem. As I will fit 100 cmdstan models per condition, I also tried to free memory within the fitting function, by doing this:

.....
 # Stan is noisy, so tell it to be more quiet()
  M3 <-  quiet(mod$sample(dat,
                          refresh = 0,
                          chains = 4,
                          #parallel_chains=4, 
                          iter_warmup=n_warmup,
                          iter_sampling=n_iter,
                          adapt_delta=adapt_delta,
                          max_treedepth=max_treedepth,
                          init = init,
                          show_messages = FALSE))
  
  M3_hyper <- M3$summary(c("hyper_pars","mu_f"), mean,Mode,sd,rhat,HDInterval::hdi)
  M3_subj <- M3$summary(c("subj_pars"), mean,sd,rhat,Mode,HDInterval::hdi)
  M3_f <- M3$summary(c("f"), mean,sd,Mode,rhat,HDInterval::hdi)
  M3_count_rep <- M3$summary(c("count_rep"),mean)
  M3_omega <- M3$summary("cor_mat_lower_tri",mean)
  
  M3_sum <- list(M3_hyper,M3_subj,M3_f,M3_count_rep,M3_omega)
  rm(M3)
  gc(full = T)
  
  return(M3_sum)

But this does not solve the problem. On every iteration, this data is saved and when the number of iterations are reached, it is summarised. This runs in parallel, as the package takes care of this. I do not save the iteration results, but the summarised results at the end of the simulation. As I will simulate 200 Conditions with 100 reps each, I need to solve this issue either way. I will definetly run 1 or 2 conditions on different nodes, so it will be at least 2500 models for each node....

Has anybody experiences with the simDesign package or slurm ram allocation and can give me some advice ? I'm relatively new in coding on a cluster, so I appreciate any advice !

cheers *

jan

Here is the Jobscript for the slave conditions:

#!/bin/bash

#SBATCH -A  acc          # Account
#SBATCH -p parallel      # Partition: parallel, smp, bigmem
#SBATCH -C skylake       # architecture Skylake (64 Cores) or Broadwell (40 Cores)  
#SBATCH -n 1                     # number of tasks
#SBATCH -N 1             # allocate one full node   
#SBATCH --ramdisk=100G       # Reserve sufficient space for job on ramdisk  
#SBATCH -t 02:30:00              # Run time (hh:mm:ss)


## Default Output 
WD="/prjtdir/M3-simulations/"

## Move job to Ramdisk for sufficient space
JOBDIR="/localscratch/${SLURM_JOB_ID}/"
RAMDISK=$JOBDIR/ramdisk

module purge # ensures vanilla environment
module load lang/R # will load most current version of R

cp $WD/sim3.R $RAMDISK
cp -R $WD/Functions $RAMDISK
cp -R $WD/Models $RAMDISK

## Change Dir to Jobfolder
cd $RAMDISK

# Run Script
srun Rscript sim3.R -N $1 -K $2 -F $3 -R $4 -P $5 -I ${SLURM_JOB_ID} -D ${WD}

And here is an excerpt of sinfo - I usually use the parallel partition with 64 cores per node

sinfo -Nel -p parallel
Sun Aug 07 01:23:29 2022
NODELIST   NODES PARTITION       STATE CPUS    S:C:T MEMORY TMP_DISK WEIGHT AVAIL_FE REASON
x0001          1  parallel    drained* 64     2:16:2  88500        0      6 anyarch, RBH_OPAFM
x0002          1  parallel     drained 64     2:16:2  88500        0      6 anyarch, RBH_OPAFM
x0003          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0004          1  parallel     drained 64     2:16:2  88500        0      6 anyarch, SlurmdSpoolDir is fu
x0005          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0006          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0007          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0008          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0009          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none
x0010          1  parallel   allocated 64     2:16:2  88500        0      6 anyarch, none

Here is the actual error I get after aprox. 40 - 60 minutes (depending on condition)

Design row: 1/1;   Started: Sun Aug  7 00:44:31 2022;   Total elapsed time: 0.00s 
sh: /tmp/RtmpyFfzBI/file9eed87b8b8e1c: No space left on device

Are you saving the results to a file somewhere? The error "no space left on device" sounds like you are running out of room on your file storage, not RAM. — MrFlick, Jul 31 '22 at 19:04
Hey, no I don’t safe the results - but even if - the job summary indicates, that there is sufficient space at the time when the msgs appear :/ — Jan Göttmann, Jul 31 '22 at 19:25
The job summary only indicates how much RAM/CPU was available. It does not tell you anything about file storage space. It's possible that particular package creates temp files on disk that are filling up available storage space. — MrFlick, Jul 31 '22 at 19:28
That’s probably true, I may check some alternative to root the output to a special scratch partition with more space. But it’s weird anyway, cause I only create a list with 5 data frames with dim 100x5 per iteration - this should not be addressing so much space .. — Jan Göttmann, Jul 31 '22 at 19:45
Thanks for the answers ! In deed the job created a huge amount of temp files. I moved the job to the scratch partition of the cluster which provides sufficient space for such operations, which solved the problem ! — Jan Göttmann, Aug 02 '22 at 09:36
Ok the problem persists - @Prakhar my Quotas are ok, in the project dir we have 1 TB space ! I ran one simulation on a weekende, everything was fine. Now I extended the simulation (just another condition), nothing what increases the memory demands per iteration - and the problem is here again. I debugged the complete script line by line - the script is ok ! I'm really desperate. I also moved the to the scratch partition, no effect. I can post the jobscripts if that would help.. — Jan Göttmann, Aug 06 '22 at 15:59
I added the job script and an sinfo output. I got 100 replications to run with this config. But as soon as I increase the number of draws or fit more data, even 1 TB of ramdisk is not sufficient. Script runs fine with 15, 30, 50 75 replications without spercifying any memory. — Jan Göttmann, Aug 06 '22 at 23:30
Good intuition, @MrFlick. The `quite()` function was indeed piping output to a `tempfile` connection. A suitable patch has been made to address this in `SimDesign`. — philchalmers, Aug 14 '22 at 03:46

score 3 · Accepted Answer · answered Aug 08 '22 at 15:48

3

I was able to fix the problem to defining the TMPDIR also on the scratch space:

## Move job to Ramdisk for sufficient space
JOBDIR="/localscratch/${SLURM_JOB_ID}/"
TMPDIR=$JOBDIR

module purge # ensures vanilla environment
module load lang/R # will load most current version of R

cp $WD/sim3.R $JOBDIR
cp -R $WD/Functions $JOBDIR
cp -R $WD/Models $JOBDIR

## Change Dir to Jobfolder
cd $JOBDIR

# Run Script
srun Rscript sim3.R -N $1 -K $2 -F $3 -R $4 -P $5 -I ${SLURM_JOB_ID} -D ${WD}

Runs with double the iterations without additional space in every condition.

answered Aug 08 '22 at 15:48

Jan Göttmann

71
8

I don't think you need the `module purge`. The job runs on a fresh compute node so there is no chance that a module is already loaded unless you were already using the same compute node in the same job session to load other modules. – Prakhar Sharma Aug 09 '22 at 08:48
Nice fix. Hopefully will not be necessary on the next release of `SimDesign` – philchalmers Aug 15 '22 at 15:51
Ah surprising that this behavior is related to the package, I thought it’s related to stan - have you patched it already ? – Jan Göttmann Aug 16 '22 at 18:35
1

@JanGöttmann Yes, the update is currently on the Github dev. And yes it's partially related to stan as well in that stan's verbose console output is redirected to a unique tempfile via `sink()`, which really should have been deleted immediately to save space in distributed jobs (which is the new behavior). – philchalmers Aug 18 '22 at 02:27
1

I ran a Simulation with the fixed version without the TMPDIR workaround - works as expected. It seems also to be way more memory efficient in general ! @philchalmers – Jan Göttmann Aug 26 '22 at 07:36
@JanGöttmann Great to hear, and thanks for using the package. – philchalmers Aug 27 '22 at 13:41

score 0 · Answer 2 · answered Nov 30 '22 at 14:46

I see this question has been answered earlier. But I felt I could add to the answers and the underlying reason for this error. So when numerous files are created by your script, each of them occupy a specific space on the disk (called inodes).. there would be a maximum number of inodes allowed for each disksystem that you use and when you overflow this number, the error message "No space left on device". This could be overcome as you have done by purging, redirecting the tmp file writing to a different destination or totally removing the step that spews out lots of temp files or small size files (1kb or 0kb files perhaps). You could buy more time and make the error message vanish if you simply increase the number of inodes accommodated by your disksystem as well.

Thx for your comment on this, unfortunatly I can't modify the inode number on the cluster, we are ristricted to 1e6 files in the project directories. But this adds some more clarity to the topic, thanks ! — Jan Göttmann, Dec 01 '22 at 15:05

Slurm Job is Running out of Memory [RAM?] but memory limit not reached

2 Answers2