How to use and understand rstats parallel processing with HPC Acenet Siku cluster, SLURM, futures, rslurm, and lapply() equivalents

Question

Background: I am using the adehabitatHR package to create utilize distribution estimates (UDEs) of wildlife populations.

I have a list of UDEs that can be converted into shapefiles using package 'terra' vect(). See lines 657-705 for current workflow.

If I were to do this on my local machine:

spatvec_UDEs <- lapply(UDEs, function(x){vect(x)})

I have thousands of UDEs to calculate and have used the parallel package to calculate the UDEs for some animals (e.g., winter range in one study area from 2002-2021), but have been unable to get that result saved and exported using the Acenet Siku cluster.

Notes from my work, progress, and minimal reproduceable example are included below. My question is about clarification on how the memory is mapping in my SLURM requests. I am seeking a workable example of parLapply(), mclapply(), or future::plan(multicore) that can help me to understand how the memory is being used and how to get the parallel function operating for either a single or multiple core. I believe that multiple cores are needed in my case to gain the necessary memory requirements to run the adehabitatHR functions on my data.

My understanding is as follows, I call a function to process elements in a list of adehabitatHR estUD objects. The memory requirements for processing get as large as 2.5G and I am running about 200 tasks in my foreach loop. I am not sure I have the correct usage of task here, but I think of this a each iteration through the loop that does a single calculation is a task (e.g., 1+i would be one task if i=1, two tasks if i=2...). I believe that one node means that I will have access to ~720G, which is the SIKU storage. If each task of the estUD object creation is ~4G, then I should have sufficient room to run 180 tasks (720G / 4 = 180).

For example, does R have to be initialized on each node in interactive mode or does this result in shared memory (e.g., 720G x 2 nodes = 1440G)? I am confused on the distributed versus shared memory and how it works in the SLURM, interactive, and bash job submissions in the Siku HPC.

I was using interactive job submissions to learn and here are my notes:

#BRB1: Results in weird infinite "Selection" repeating screen for
#mclapply

    salloc --time=3:00:0 --ntasks=1 --cpus-per-task=8 --mem-per-cpu=23875M

#BRB2: Elapsed 4619 (1h16), not valid cluster for parLapply()

    salloc --time=3:00:0 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M

#BRB3: Elapsed 4700 (slower than b4). In this case N=16,  N is the num
#of (OpenMP) parallel threads #on --cpus-per-task. Add cl=cl in
#parLapply(), failed.

    salloc --time=3:00:0 --nodes=1 --ntasks=1 --cpus-per-task=16 --mem-per-cpu=11937M

#BRB4: Elapsed 4262.420 Fastest. Failed parLapply()

    salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=11937M

#BRB5: Elapsed 4565.692, so extra mem-per-cpu did not make it faster exactly.
#Attempted:
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI,
    function (x) {vect(x)},
                  nodes=1,
                  cpus_per_node = 1)
#Result:
"sbatch: error: Batch job submission failed: Requested partition
configuration not available now."

    salloc --time=3:00:0 --nodes=1 --cpus-per-task=20 --mem-per-cpu=18500M

This is a minimal reproduceable example where I attempt to save as an Rds:

#BRB6: 

salloc --time=3:00:0 --nodes=2 --cpus-per-task=20 --mem-per-cpu=18500M

#For BRB6 using min code example. Example dataset is available
#in the adehabitatHR package and presented here:

library('adehabitatHR')
library('foreach')

data(puechcirc)
pc1 <- puechcirc[1]
pc2 <- puechcirc[2]
pc3 <- puechcirc[3]
Traj_li <- vector("list", 3)
Traj_li[[1]] <- pc1
Traj_li[[2]] <- pc2
Traj_li[[3]] <- pc3
DLik <- c(2.1, 2.2, 4)

system.time({
BRBs_PC <- foreach(i = 1:length(Traj_li),
                      .combine = c,
                      .packages = c("adehabitatHR","adehabitatLT", "terra")) %dopar% {
                          BRB(Traj_li[[i]][1],
                          D = DLik[i],
                          Tmax = 1500*60,
                          Lmin = 2,
                          hmin = 20,
                          type = "UD",
                          grid = 4000)
                      }
    })

thenames <- c("pn1", "pn2", "pn3")
WRds <- function (x) {saveRDS(x,
                        paste0(here("BRB_UDs"), "/",
                        thenames,".Rds"))
                        }
BRBs_BM_WIv <- slurm_map(BRBs_BM_WI, f=WRds, nodes=2, cpus_per_node = 1)

The minimal reproduceable BRB6 result (2 nodes):

sbatch: error: Batch job submission failed: Requested partition configuration not available now

Error in strsplit(sys_out, " ")[[1]] : subscript out of bounds
In addition: Warning message:
In system("sbatch submit.sh", intern = TRUE) :
  running command 'sbatch submit.sh' had status 1

These are some of the types of things I have tried as I read and learn about parallel processing on a cluster:

mclapply(X = BRBs_BM_WIv, FUN=WV, mc.cores = n.cores)

my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "FORK"
  )
doParallel::registerDoParallel(cl = my.cluster)
WVec <- function (x) {vect(x)}
BRBs_BM_WIv <- parLapply(cl=my.cluster,
                         BRBs_BM_WI,
                         FUN = MVec)
stopCluster(my.cluster)


my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "FORK"
  )
doParallel::registerDoParallel(cl = my.cluster)
WV <- function (x) {writeVector(x,
                            paste0(here("BRB_UDs"), "/",
                            thenames,".shp"),
                            overwrite=TRUE)}
parLapply(cl=my.cluster, X = BRBs_BM_WIv, FUN = MV)
stopCluster(my.cluster)

I have been reading extensively on this topic (for example), but have been unable to find specific examples that can help me with the issues I am facing.

I am reading and trying to learn, but the resource material on this seems to be in many places. Curious to know if there is a single great and consolidated resource? A good book that would help understand how R communicates and working across the cluster. This article: https://cran.r-project.org/web/packages/slurmR/vignettes/working-with-slurm.html helped with the definitions, but task is still a little unclear to me. — Mark Thompson, Jun 20 '23 at 00:48
Learning more - my Slurm request BRB6 was not the best or correct for this job. I changed to --nodes=2 --cpus-per-task=as.vector(future::availableCores()) --mem=0. However, I get the same error with slurm_map() even with setting cpus-per-node=as.vector(future::availableCores()). I have checked the configuration using >scontrol show nodes and everything matches up. I cannot get parallel functions to work on >1 node. — Mark Thompson, Jun 20 '23 at 16:10

score 0 · Answer 1 · answered Jun 18 '23 at 16:08

I received some help and have a partial answer. One part of the confusion is that CPUs in SLURM terminology refers to cores, so --cpus-per-task = N is actually N = cores. Ensuring that I worked on a single node produced a result with the following SLURM request:

salloc --time=3:00:0 --nodes=1 --cpus-per-task=40 --mem=0

Using the minimal example code in my question, the following was able to produce the first result I was after:

n.cores <- future::availableCores() - 1
my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "PSOCK"
  )

#register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)

system.time({
BRBs_BM_WI <- foreach(i = 1:length(Traj_li),
                      .combine = c,
                      .packages = c("adehabitatHR","adehabitatLT", "here")) %dopar% {
                            saveRDS(BRB(Traj_li[[i]][1],
                            D = DLik[i],
                            Tmax = 1500*60,
                            Lmin = 2,
                            hmin = 20,
                            type = "UD",
                            grid = 4000),paste0(here("BRB_UDs"),"/",thenames[i],"toy.Rds"))
                        }})
stopCluster(my.cluster)

Note that I have a vector called thenames that was created do address a naming struggle I have had with the adehabitatHR UDest or BRBest objects (see here). Also note that I used the future function future::availableCores() to get a measure of the cores on the network, which is not the same as parallel::detectCores(). The next step in my process takes the calculated object to estimate the home range size and convert these into other types of spatial objects. I was able to use the foreach approach above, but found that the future package approach (see here and here) was faster:

### future approach, works fastest 311 sec
plan(multicore)
BRBs_BM_WI <- lapply(BRBs_BM_WIf, function(x){readRDS(x)})

system.time(BRBs_BM_WIv <- future_lapply(BRBs_BM_WI, FUN = function(x) {getverticeshr.estUD(x)}))
name_burst <- list()
system.time(for(i in 1:length(thenames)){
    vect(BRBs_BM_WIv[i],paste0(here("BRB_UDs"),"/",thenames[i],"toy_hr.Rds"))
        homerangedf <- as.data.frame(BRBs_BM_WIv[i])
        name_burst[[i]] <- adehabitatLT::burst(Traj_li[[i]][1])
        BRB_area[i,c(1:4)] <- rbind(data.frame(id = name_burst[[i]],
                                                    year = substr(names(Traj_li[i]),7,11),
                                                    area = homerangedf[,2],
                                                    nb.reloc = nrow(Traj_li[[i]][[1]])))
     })


## parallel approach,370.487 Sec
n.cores <- as.vector(future::availableCores()) - 1
my.cluster <- parallel::makeCluster(
  n.cores, 
  type = "PSOCK"
  )

#register it to be used by %dopar%
doParallel::registerDoParallel(cl = my.cluster)

BRBs_BM_WI <- lapply(BRBs_BM_WIf, function(x){readRDS(x)})

system.time({
homerange <- foreach(i = 1:length(BRBs_BM_WI),
                      .combine = c,
                      .packages = c("adehabitatHR","here")) %dopar% {
                      saveRDS(getverticeshr.estUD(BRBs_BM_WI[[i]]),
                              paste0(here("BRB_UDs"),"/",thenames[i],"toy_hr.Rds"))
                        }    
    })
# Stop the parallel backend
stopCluster(my.cluster)

The remaining issue is that this works on my minimal example code, but not my actual data. I run into memory issues to calculate getverticeshr.estUD() or getvolumeUD(), which either produces a killed output or:

Error in mcfork(detached) :
  unable to fork, possible reason: Cannot allocate memory

My understanding from --mem=0 is that this reserves all available memory (see here). I do not know how to solve the memory issues and think that I might need to put this onto multiple nodes to achieve the functionality I require. It would be helpful to have additional guidance on how to more dynamically allocate memory and processing tasks as needed to run these types of analyses for wildlife studies.

Robert Hijmans · Answer 2 · 2023-06-18T17:09:24.867

I am unsure if I understand what you are asking (the question is too long and involved); but it seems that your main question is how to run R in parallel using SLURM. You seem to focus on using multiple CPUs whereas I would first make sure to use multiple nodes.

I do these things like this (there may be better ways):

Create a file bashR.sh with these contents (or something similar that works on your system)

#!/bin/bash -l

module load R
Rscript --vanilla ${1} ${2} ${3}

Write an R script (here test.R) with this general form

fun <- function(x, parameter) {
    print(x)
    print(x * parameter)
    print("done")
}

arg <- as.integer(commandArgs(trailingOnly=TRUE)[1])
i <- as.numeric(Sys.getenv("SLURM_ARRAY_TASK_ID"))

n <- 10 
if (i <= n) {
    fun(i, arg)
} else {
    print("all done")
}

Where i represents a case that can be processed in parallel (independently from all other cases). You probably do not need argument(s) arg

Now run the R script using sbatch:

sbatch --array=1-10 --time=60 --mem=32192 bashR.sh test.R 42
#Submitted batch job 66585267

You may want to add parameters such as -p (partition) and change the time and mem needs. The array argument is like a for loop iterator (in this example, in R speak, for (i in 1:10)

The output will look something like this; your filename (job number) will of course be different.

cat slurm-66585267_5.out
#==========================================
#SLURM_JOB_ID = 66585272
#SLURM_NODELIST = cpu-11-97
#==========================================
#Unloading openmpi/4.1.5
#Unloading slurm/22.05.8
#Loading slurm/22.05.8
#Loading openmpi/4.1.5
#Loading conda/base/latest <aL>
#
#Loading R/4.2.3
#  Loading requirement: conda/base/latest
#[1] 5
#[1] 210
#[1] "done"

And, at some point, remove the slurm log files

rm slurm-*

If this is what you are looking for, I would implement this with baby steps. First change the function fun to make sure that it identifies the right case, finds the input data, can write some output data.

Thank you! I do not fully understand what you have posted here, but will study it and I am working with others who can assist. Sorry for the long question. Partly, I was not sure of what to ask because I do not understand the nature of the problem beyond insufficient local memory and that more nodes will help. Biologist trying to do programming. — Mark Thompson, Jun 19 '23 at 03:38

How to use and understand rstats parallel processing with HPC Acenet Siku cluster, SLURM, futures, rslurm, and lapply() equivalents

2 Answers2