1

I am a graduate student (master's) and use an in-house code for running my simulations that use MPI. Earlier, I used OpenMPI on a supercomputer we used to access and since it shut down I've been trying to switch to another supercomputer that has Intel MPI installed on it. The problem is, the same code that was working perfectly fine earlier now gives memory leaks after a set number of iterations (time steps). Since the code is relatively large and my knowledge of MPI is very basic, it is proving very difficult to debug it. So I installed OpenMPI onto this new supercomputer I am using, but it gives the following error message upon execution and then terminates:

Invalid number of PE Please check partitioning pattern or number of PE

NOTE: The error message is repeated for as many numbers of nodes I used to run the case (here, 8). Compiled using mpif90 with -fopenmp for thread parallelisation.

There is in fact no guarantee that running it on OpenMPI won't give the memory leak, but it is worth a shot I feel, as it was running perfectly fine earlier.

PS: On Intel MPI, this is the error I got (compiled with mpiifort with -qopenmp)

Abort(941211497) on node 16 (rank 16 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack: PMPI_Isend(152)...........: MPI_Isend(buf=0x2aba1cbc8060, count=4900, dtype=0x4c000829, dest=20, tag=0, MPI_COMM_WORLD, request=0x7ffec8586e5c) failed MPID_Isend(662)...........: MPID_isend_unsafe(282)....: MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object Abort(203013993) on node 17 (rank 17 in comm 0): Fatal error in PMPI_Isend: >Unknown error class, error stack: PMPI_Isend(152)...........: MPI_Isend(buf=0x2b38c479c060, count=4900, dtype=0x4c000829, dest=21, tag=0, MPI_COMM_WORLD, request=0x7fffc20097dc) failed MPID_Isend(662)...........: MPID_isend_unsafe(282)....: MPIDI_OFI_send_normal(305): failure occurred while allocating memory for a request object [mpiexec@cx0321.obcx] HYD_sock_write (../../../../../src/pm/i_hydra/libhydra/sock/hydra_sock_intel.c:357): write error (Bad file descriptor) [mpiexec@cx0321.obcx] cmd_bcast_root (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:164): error sending cmd 15 to proxy [mpiexec@cx0321.obcx] send_abort_rank_downstream (../../../../../src/pm/i_hydra/mpiexec/intel/i_mpiexec.c:557): unable to send response downstream [mpiexec@cx0321.obcx] control_cb (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1576): unable to send abort rank to downstreams [mpiexec@cx0321.obcx] HYDI_dmx_poll_wait_for_event (../../../../../src/pm/i_hydra/libhydra/demux/hydra_demux_poll.c:79): callback returned error status [mpiexec@cx0321.obcx] main (../../../../../src/pm/i_hydra/mpiexec/mpiexec.c:1962): error waiting for event"

I will be happy to provide the code in case somebody is willing to take a look at it. It is written using Fortran with some of the functions written in C. My research progress has been completely halted due to this problem and nobody at my lab has enough experience with MPI to resolve this.

  • can you please copy/paste the error messages? This will have search engines correctly index the question and hence make it (more) useful to other/future readers. – Gilles Gouaillardet Mar 13 '20 at 04:40
  • Please provide the version of Intel MPI, your command line, and which interconnect do you have? Also, is this a public code? Is there a way you can provide the source? You would be able to provide source privately if you post on the Intel forum: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology – ChileAddict - Intel Mar 27 '20 at 22:08
  • @ChileAddict-Intel Thank you for your reply. Unfortunately it is not a public code, but if you'd be willing to look into it, I will be happy share it with you privately. – Vedabit Saha Mar 30 '20 at 11:47
  • @ChileAddict-Intel as for Intel MPI version: 19.0.4.243 20190416 – Vedabit Saha Mar 30 '20 at 11:47
  • @ChileAddict-Intel I'm sorry I'm not sure what you mean by interconnect, but command line : Red Hat Enterprise Linux Server release 7.6 – Vedabit Saha Mar 30 '20 at 12:04
  • @VedabitSaha our expert tried to assist you but could not enter a comment due to lack of reputation so his "answer" got deleted by one of the moderators. Since you are stuck, I would post your question on our Intel Forum: https://software.intel.com/en-us/forums/intel-clusters-and-hpc-technology – ChileAddict - Intel Mar 30 '20 at 16:26
  • @VedabitSaha maybe you could try with a more current release of Intel MPI? It is at 19.06 now. Your Linux release looks to be supported so probably not an issue there. – ChileAddict - Intel Mar 30 '20 at 18:54

0 Answers0