2

I got access to an HPC cluster with a MPI partition and I'm trying to submit a job via SLURM using several nodes. I'm new to the use of supercomputers and to the use of MPI. Also this is my first stack question.

The R sample script that I'm using looks like this:

library(snow) 
library(doSNOW) 
library(foreach)
library(Rmpi)
cl<- makeCluster(4-1, type="MPI")
registerDoSNOW(cl)
np<-getDoParWorkers()
np

lista = list()

foreach(i = 1:1000) %do%{
  lista[[i]] = matrix(rnorm(36),nrow=1000, ncol = 1000)
}

listainv = list()

foreach(i = 1:1000) %do%{
  listainv[[i]] = solve(lista[[i]])
}

listainv[[1]]

And my batch file is the following:

#!/bin/bash -l
#SBATCH --job-name MyR
#SBATCH --partition=rome --mem=0 --time=1-24:00:00
#SBATCH --output=./Results/Results_from_jobs/final_b.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2



MyRProgram="./test_scripts/sample_code_matrix.R"

module use /opt/pkg/ITER/modules/all
module load R


mpirun -np 1 R --oversubscribe --vanilla -f $MyRProgram

When I run it using slurm it actually goes to the nodes but I get the following output with an error:

> library(snow)
> library(doSNOW)
Loading required package: foreach
Loading required package: iterators
> library(foreach)
> library(Rmpi)
> cl<- makeCluster(4-1, type="MPI")
[cn49.hpc:19074] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[cn49.hpc:19074] [[24597,2],2] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[cn49.hpc:19073] pml_ucx.c:178  Error: Failed to receive UCX worker address: Not found (-13)
[cn49.hpc:19073] [[24597,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_dpm_dyn_init() failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[cn49:19074] *** An error occurred in MPI_Init
[cn49:19074] *** reported by process [1611988994,2]
[cn49:19074] *** on a NULL communicator
[cn49:19074] *** Unknown error
[cn49:19074] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cn49:19074] ***    and potentially your MPI job)
[cn48.hpc:38141] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[cn48.hpc:38141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn48.hpc:38141] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle

Any idea with what's going on?

For writing this code I got insipired by this post. Thank you very much!

luciixi99
  • 21
  • 1
  • Cluster spawning with `makeCluster` was developed for clusters of workstations or laptops rather than HPC clusters, where SPMD style programming with SLURM resource allocation is prevalent. Take a look at https://stackoverflow.com/a/73793344/4103425, where an SPMD approach in R is shown. – George Ostrouchov Sep 24 '22 at 05:13

0 Answers0