3

I am trying to parallelize a small part of my python code in Fortran90. So, as a start, I am trying to understand how the spawning function works.

Firstly, I tried to spawn a child process in python from a python parent. I used the example for dynamic process management from the mpi4py tutorial. Everything worked fine. In this case, from what I understand, only the inter-communicator between the parent process and the child process is used.

Then, I moved on to an example for spawning a child process in fortran90 from a python parent. For this, I used an example from one of the previous post in stackoverflow.The python code (master.py) that spawns the fortran child is as follows:

from mpi4py import MPI
import numpy

'''
slavef90 is an executable built starting from slave.f90
'''
# Spawing a process running an executable
# sub_comm is an MPI intercommunicator
sub_comm = MPI.COMM_SELF.Spawn('slavef90', args=[], maxprocs=1)
# common_comm is an intracommunicator accross the python process and the spawned process.
# All kind sof collective communication (Bcast...) are now possible between the python process and the c process
common_comm=sub_comm.Merge(False)
print('parent in common_comm ', common_comm.Get_rank(), ' of  ', common_comm.Get_size())
data = numpy.arange(1, dtype='int32')
data[0]=42
print("Python sending message to fortran: {}".format(data))
common_comm.Send([data, MPI.INT], dest=1, tag=0)

print("Python over")
# disconnecting the shared communicators is required to finalize the spawned process.
sub_comm.Disconnect()
common_comm.Disconnect()

The corresponding fortran90 code (slave.f90) where the child processes get spawned is as follows:

  program test
  !
  implicit none
  !
  include 'mpif.h'
  !
  integer :: ierr,s(1),stat(MPI_STATUS_SIZE)
  integer :: parentcomm,intracomm
  !
  call MPI_INIT(ierr)
  call MPI_COMM_GET_PARENT(parentcomm, ierr)
  call MPI_INTERCOMM_MERGE(parentcomm, 1, intracomm, ierr)
  call MPI_RECV(s, 1, MPI_INTEGER, 0, 0, intracomm,stat, ierr)
  print*, 'fortran program received: ', s
  call MPI_COMM_DISCONNECT(intracomm, ierr)
  call MPI_COMM_DISCONNECT(parentcomm, ierr)
  call MPI_FINALIZE(ierr)
  endprogram test

I compiled the fortran90 code using mpif90 slave.f90 -o slavef90 -Wall. I ran the python code normally using python master.py. I am able to get the desired output, but, the spawned processes won't disconnect, i.e., any statements after the Disconnect commands (call MPI_COMM_DISCONNECT(intracomm, ierr) and call MPI_COMM_DISCONNECT(parentcomm, ierr)) wont be executed in the fortran code (and hence any statements after the Disconnect commands in the python code is also not executed) and my code wont terminate in the terminal.

In this case, to my understanding, the inter-communicator and the intra-communicator are merged so that the child processes and parent processes are not two different groups anymore. And, there seems to be some problem when disconnecting them. But, I am not able to figure out a solution. I tried reproducing the fortran90 code where the child processes are spawned in C++ and in python as well and faced the same problem. Any help is appreciated. Thanks.

  • Please use more general language tags [tag:fortran] [tag:python]. You can always add a version tag if necessary. See the tag description when adding one. It is recommended to use the MPI module `use mpi` instead of including the `mpif.h` file. Then the compiler can check many things and warn you or stop you from doing wrong stuff. – Vladimir F Героям слава Apr 25 '20 at 12:02
  • What if you `MPI_Comm_free()` the intra(aka merged) communicator and `MPI_Comm_discomnect()` only the inter communicator? – Gilles Gouaillardet Apr 25 '20 at 12:27
  • @GillesGouaillardet Yes, I tried freeing the intra communicators instead of disconnecting them. Unfortunately, the problem still exists. – AdhityaRavi Apr 25 '20 at 12:34
  • Did you try writing the spawner in C?(to make sure the error does not come from the MPI library) – Gilles Gouaillardet Apr 25 '20 at 14:48
  • Also, can you try running `mpirun -np 1 python master.py` – Gilles Gouaillardet Apr 25 '20 at 14:57
  • @GillesGouaillardet Thank you for the suggestions. I tried the `mpirun` suggestion. It didn't work unfortunately. I still have to write the spawner in C. I also found an [open question](https://stackoverflow.com/questions/42446934/mpi4py-freezes-when-calling-merge-and-disconnect?noredirect=1&lq=1) in the same topic. It didn't help me much. Fortunately, (so far) I am able to parallelize my code without having to use `Merge()`. Everything works smoothly as long as the parent and the child processes are kept as a separate group. – AdhityaRavi Apr 26 '20 at 18:08
  • Out of curiosity, on top of which MPI library (vendor and version) is your `mpi4py` built on? – Gilles Gouaillardet Apr 27 '20 at 03:57
  • The MPI library that is being used is Open MPI 2.1.1 – AdhityaRavi Apr 27 '20 at 10:01

1 Answers1

1

Note your python script first disconnects the inter-communicator, and then the intra-communicator, but your Fortran program first disconnects the intra-communicator and then the inter communicator.

I am able to run this test on mac (Open MPI and mpi4py installed by brew) after fixing the order and free-ing the intra-communicator.

Here is my master.py

#!/usr/local/Cellar/python@3.8/3.8.2/bin/python3

from mpi4py import MPI
import numpy

'''
slavef90 is an executable built starting from slave.f90
'''
# Spawing a process running an executable
# sub_comm is an MPI intercommunicator
sub_comm = MPI.COMM_SELF.Spawn('slavef90', args=[], maxprocs=1)
# common_comm is an intracommunicator accross the python process and the spawned process.
# All kind sof collective communication (Bcast...) are now possible between the python process and the c process
common_comm=sub_comm.Merge(False)
print('parent in common_comm ', common_comm.Get_rank(), ' of  ', common_comm.Get_size())
data = numpy.arange(1, dtype='int32')
data[0]=42
print("Python sending message to fortran: {}".format(data))
common_comm.Send([data, MPI.INT], dest=1, tag=0)

print("Python over")
# free the (merged) intra communicator
common_comm.Free()
# disconnect the inter communicator is required to finalize the spawned process.
sub_comm.Disconnect()

and my slave.f90

  program test
  !
  implicit none
  !
  include 'mpif.h'
  !
  integer :: ierr,s(1),stat(MPI_STATUS_SIZE)
  integer :: parentcomm,intracomm
  integer :: rank, size
  !
  call MPI_INIT(ierr)
  call MPI_COMM_GET_PARENT(parentcomm, ierr)
  call MPI_INTERCOMM_MERGE(parentcomm, .true., intracomm, ierr)
  call MPI_COMM_RANK(intracomm, rank, ierr)
  call MPI_COMM_SIZE(intracomm, size, ierr)
  call MPI_RECV(s, 1, MPI_INTEGER, 0, 0, intracomm,stat, ierr)
  print*, 'fortran program', rank, ' / ', size, ' received: ', s
  print*, 'Slave frees intracomm'
  call MPI_COMM_FREE(intracomm, ierr)
  print*, 'Slave disconnect intercomm'
  call MPI_COMM_DISCONNECT(parentcomm, ierr)
  print*, 'Slave finalize'
  call MPI_FINALIZE(ierr)
  endprogram test
Gilles Gouaillardet
  • 8,193
  • 11
  • 24
  • 30
  • Thank you very much!!!. This makes life a lot easier for me now. I can't believe that I made the stupid mistake of swapping the `Disconnect()` commands. Anyway, thank you once again! – AdhityaRavi Apr 28 '20 at 08:49
  • For closure, could you tell me why the code freezes if we disconnect the intracomm instead of freeing it? I understood (which could be incorrect) from the [MPI reference guide](https://books.google.de/books?id=uK3nr41r8zMC&pg=PA90&lpg=PA90&dq=difference+between+freeing+and+disconnecting+a+mpi+communicator&source=bl&ots=9pt1hStEkx&sig=ACfU3U1oP0f9xcvmfwf9qN5Dxq64esNHDQ&hl=en&sa=X&ved=2ahUKEwj1h9vA3IrpAhVBMewKHTduDpYQ6AEwAHoECAgQAQ#v=onepage&q=difference%20between%20freeing%20and%20disconnecting%20a%20mpi%20communicator&f=false) that `Disconnect()` is preferable over `Free()`. – AdhityaRavi Apr 28 '20 at 08:49
  • I do not know for sure, but I've heard `MPI_Comm_disconnect()` might not be correctly implemented in Open MPI, and that could explain the freeze. – Gilles Gouaillardet Apr 28 '20 at 09:06
  • 1
    it works for me. did you update the master so it sends a message to all the children? if not, children except the first one will be stuck on `MPI_Recv()`. also, note Open MPI 2.1.1 is no more supported, and you should consider upgrading to a more recent one such as 4.0.3 – Gilles Gouaillardet Apr 28 '20 at 09:19
  • Yes. Got it!. You're right, I am upgrading my `Open MPI` as I am commenting. Thank you for the support!. – AdhityaRavi Apr 28 '20 at 09:23