1

I am testing MPI I/O.

  subroutine save_vtk
    integer :: filetype, fh, unit
    integer(MPI_OFFSET_KIND) :: pos
    real(RP),allocatable :: buffer(:,:,:)
    integer :: ie

    if (master) then
      open(newunit=unit,file="out.vtk", &
           access='stream',status='replace',form="unformatted",action="write")
      ! write the header
      close(unit)
    end if

    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_open(mpi_comm,"out.vtk", MPI_MODE_APPEND + MPI_MODE_WRONLY, MPI_INFO_NULL, fh, ie)

    call MPI_Type_create_subarray(3, int(ng), int(nxyz), int(off), &
       MPI_ORDER_FORTRAN, MPI_RP, filetype, ie)

    call MPI_type_commit(filetype, ie)

    call MPI_Barrier(mpi_comm,ie)
    call MPI_File_get_position(fh, pos, ie)
    call MPI_Barrier(mpi_comm,ie)

    call MPI_File_set_view(fh, pos, MPI_RP, filetype, "native", MPI_INFO_NULL, ie)

    buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

    call MPI_File_write_all(fh, buffer, nx*ny*nz, MPI_RP, MPI_STATUS_IGNORE, ie)

    call MPI_File_close(fh, ie)

  end subroutine

The undefined variables come from host association, some error checking removed. I receive this error when running it on a national academic cluster:

*** An error occurred in MPI_Isend
*** reported by process [3941400577,18036219417246826496]
*** on communicator MPI COMMUNICATOR 20 DUP FROM 0
*** MPI_ERR_BUFFER: invalid buffer pointer
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)

The error is triggered by the call to MPI_File_write_all. I am suspecting it may be connected with size of the buffer which is the full nx*ny*nz which is in the order of 10^5 to 10^6., but I cannot exclude a programming error on my side, as I have no prior experience with MPI I/O.

The MPI implementation used is OpenMPI 1.8.0 with the Intel Fortran 14.0.2.

Do you know how to make it work and write the file?

--- Edit2 ---

Testing a simplified version, the important code remains the same, full source is here. Notice it works with gfortran and fails with different MPI's with Intel. I wasn't able to compile it with PGI. Also I was wrong in that it fails only on different nodes, it fails even in single process run.

>module ad gcc-4.8.1
>module ad openmpi-1.8.0-gcc
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
>mpirun a.out
 Trying to decompose in           1           1           2 process grid.

>module rm openmpi-1.8.0-gcc
>module ad openmpi-1.8.0-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



>module rm openmpi-1.8.0-intel
>module ad openmpi-1.6-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 MPI_ERR_IO: input/output error                                                 



[luna24.fzu.cz:24260] *** An error occurred in MPI_File_set_errhandler
[luna24.fzu.cz:24260] *** on a NULL communicator
[luna24.fzu.cz:24260] *** Unknown error
[luna24.fzu.cz:24260] *** MPI_ERRORS_ARE_FATAL: your MPI job will now abort
--------------------------------------------------------------------------
An MPI process is aborting at a time when it cannot guarantee that all
of its peer processes in the job will be killed properly.  You should
double check that everything has shut down cleanly.

  Reason:     After MPI_FINALIZE was invoked
  Local host: luna24.fzu.cz
  PID:        24260
--------------------------------------------------------------------------
>module rm openmpi-1.6-intel
>module ad mpich2-intel
>mpif90 save.f90
>./a.out 
 Trying to decompose in           1           1           1 process grid.
 ERROR write_all
 Other I/O error , error stack:
ADIOI_NFS_WRITECONTIG(70): Other I/O error Bad a
 ddress  
  • 2
    How much is `nx*ny*nz`? – Hristo Iliev May 12 '14 at 21:19
  • It is the proportional part of the global size `ng = [131, 123, 127]` which is divided between the processes. I tested with `2-16` processes so it is in the order of `1E5` to `1E6`. – Vladimir F Героям слава May 12 '14 at 21:34
  • 2
    Could you try with an earlier version of Open MPI? Could you also present a compilable MWE that reproduces the problem? – Hristo Iliev May 12 '14 at 22:12
  • If I remove the computational part before it should be possible to compile it also with the older compiler used for other OpenMPI on this cluster. I will try. – Vladimir F Героям слава May 13 '14 at 07:30
  • 1
    Always useful to compile with all warnings/checking on; with intel, that's `-warn all -check all`, with gfortran, `-Wall -fcheck=all`. You can't see it in the question, but in your posted code, `buffer` is never allocated. `allocate(buffer(nx,ny,nz))` before the call to BigEndian and `deallocate(buffer)` after the write fixes the problem. – Jonathan Dursi May 13 '14 at 12:37
  • 1
    @JonathanDursi Thanks! I actually did use `-fcheck=all` but with `gfortran`. I often forget one has to add `-standard-semantics` to `ifort` to enable automatic reallocation. It is a really annoying feature of `ifort`. – Vladimir F Героям слава May 13 '14 at 13:02
  • `-check` indeed gives `forrtl: severe (408): fort: (8): Attempt to fetch from allocatable variable BUFFER when it is not allocated`, `-standard-semantics` resolves it. When debuging on my desctop I quickly put `-fcheck=all -Wall` to the build script but obviously `ifort` ignores it. – Vladimir F Героям слава May 13 '14 at 13:08
  • @VladimirF Why not put that into an answer? I didn't know about `-standard-semantics`, but I stumbled over automatic reallocation with `ifort` before (and found the bug by using an old version og `gfortran`). Your comment would have been helpful to me then ;-) – Alexander Vogt May 13 '14 at 16:31
  • I don't know if anybody is going to find it here, but I can try. Also I did not identify the problem myself. – Vladimir F Героям слава May 13 '14 at 18:43

1 Answers1

2

In line

 buffer = BigEnd(Phi(1:nx,1:ny,1:nz))

the array buffer should be allocated automatically to the shape of the right hand side according to the Fortran 2003 standard (not in Fortran 95). Intel Fortran as of version 14 does not do this by default., it requires the option

-assume realloc_lhs

to do that. This option is included (with other options) in option

-standard-semantics

Because this option was not in effect when the code in the question was tested the program accessed an unallocated array and undefined behavior leading to a crash followed.