I got access to an HPC cluster with a MPI partition and I'm trying to submit a job via SLURM using several nodes. I'm new to the use of supercomputers and to the use of MPI. Also this is my first stack question.
The R sample script that I'm using looks like this:
library(snow)
library(doSNOW)
library(foreach)
library(Rmpi)
cl<- makeCluster(4-1, type="MPI")
registerDoSNOW(cl)
np<-getDoParWorkers()
np
lista = list()
foreach(i = 1:1000) %do%{
lista[[i]] = matrix(rnorm(36),nrow=1000, ncol = 1000)
}
listainv = list()
foreach(i = 1:1000) %do%{
listainv[[i]] = solve(lista[[i]])
}
listainv[[1]]
And my batch file is the following:
#!/bin/bash -l
#SBATCH --job-name MyR
#SBATCH --partition=rome --mem=0 --time=1-24:00:00
#SBATCH --output=./Results/Results_from_jobs/final_b.out
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=2
MyRProgram="./test_scripts/sample_code_matrix.R"
module use /opt/pkg/ITER/modules/all
module load R
mpirun -np 1 R --oversubscribe --vanilla -f $MyRProgram
When I run it using slurm it actually goes to the nodes but I get the following output with an error:
> library(snow)
> library(doSNOW)
Loading required package: foreach
Loading required package: iterators
> library(foreach)
> library(Rmpi)
> cl<- makeCluster(4-1, type="MPI")
[cn49.hpc:19074] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[cn49.hpc:19074] [[24597,2],2] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
[cn49.hpc:19073] pml_ucx.c:178 Error: Failed to receive UCX worker address: Not found (-13)
[cn49.hpc:19073] [[24597,2],1] ORTE_ERROR_LOG: Error in file dpm/dpm.c at line 493
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort. There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems. This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):
ompi_dpm_dyn_init() failed
--> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
[cn49:19074] *** An error occurred in MPI_Init
[cn49:19074] *** reported by process [1611988994,2]
[cn49:19074] *** on a NULL communicator
[cn49:19074] *** Unknown error
[cn49:19074] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[cn49:19074] *** and potentially your MPI job)
[cn48.hpc:38141] 1 more process has sent help message help-mpi-runtime.txt / mpi_init:startup:internal-failure
[cn48.hpc:38141] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages
[cn48.hpc:38141] 1 more process has sent help message help-mpi-errors.txt / mpi_errors_are_fatal unknown handle
Any idea with what's going on?
For writing this code I got insipired by this post. Thank you very much!