MPI collective output 5 noncontiguous 3D arrays in special form

Question

During the realization of the course work I have to write MPI program to solve PDE continuum mechanics. (FORTRAN)

In the sequence program file is written as follows:

do i=1,XX
    do j=1,YY
        do k=1,ZZ
            write(ifile) R(i,j,k)
            write(ifile) U(i,j,k)
            write(ifile) V(i,j,k)
            write(ifile) W(i,j,k)
            write(ifile) P(i,j,k)
        end do
    end do
end do

In the parallel program, I write the same as follows: / parallelization takes place only along the axis X /

call MPI_TYPE_CREATE_SUBARRAY(4, [INT(5), INT(ZZ),INT(YY), INT(XX)], [5,ZZ,YY,PDB(iam).Xelements], [0, 0, 0, PDB(iam).Xoffset], MPI_ORDER_FORTRAN, MPI_FLOAT, slice, ierr)
call MPI_TYPE_COMMIT(slice, ierr)   

call MPI_FILE_OPEN(MPI_COMM_WORLD, cFileName, IOR(MPI_MODE_CREATE, MPI_MODE_WRONLY), MPI_INFO_NULL, ifile, ierr)

do i = 1,PDB(iam).Xelements
    do j = 1,YY
        do k = 1,ZZ
            dataTmp(1,k,j,i) = R(i,j,k)
            dataTmp(2,k,j,i) = U(i,j,k)
            dataTmp(3,k,j,i) = V(i,j,k)
            dataTmp(4,k,j,i) = W(i,j,k)
            dataTmp(5,k,j,i) = P(i,j,k)
        end do
    end do
end do

call MPI_FILE_SET_VIEW(ifile, offset, MPI_FLOAT, slice, 'native', MPI_INFO_NULL, ierr)
call MPI_FILE_WRITE_ALL(ifile, dataTmp, 5*PDB(iam).Xelements*YY*ZZ, MPI_FLOAT, wstatus, ierr)
call MPI_BARRIER(MPI_COMM_WORLD, ierr)

It works well. But I'm not sure about using an array dataTmp. What solution will be faster and more correct? What about using 4D array like the dataTmp in the whole program? Or, maybe, I should create 5 special mpi_types with different displacemet.

score 1 · Answer 1 · answered Oct 15 '14 at 02:01

1

Using dataTmp is fine, if you have the memory space. your MPI_FILE_WRITE_ALL call will be the most expensive part of this code.

You've done the hard part, setting an MPI-IO file view. if you want to get rid of dataTmp, you could create an MPI datatype to describe the arrays (probably using MPI_Type_hindexed and MPI_Get_address)), then use MPI_BOTTOM as the memory buffer.

answered Oct 15 '14 at 02:01

Rob Latham

5,085
3
27
44

Thank you for answer! Is the way where i get rid of dataTmp is faster(or of equal speed) than way where I use this array? If it slower it not fits me. If you dont mind, please, look up my another question about MPI-IO of little-endian file in big-endian systems. – Vasiliy Oct 15 '14 at 15:02

Jonathan Dursi · Accepted Answer · 2014-10-15T15:11:04.670

If I/O speed is an issue and you have the option, I'd suggest changing the file format - or alternately, how the data is laid out in memory - to be more closely lined up: in the serial code, writing data in this transposed and interleaved way is going to be very slow:

program testoutput
implicit none

integer, parameter :: XX=512, YY=512, ZZ=512
real, dimension(:,:,:), allocatable :: R, U, V, W, P
integer :: timer
integer :: ifile
real :: elapsed
integer :: i,j,k

allocate(R(XX,YY,ZZ), P(XX,YY,ZZ))
allocate(U(XX,YY,ZZ), V(XX,YY,ZZ), W(XX,YY,ZZ))

R = 1.; U = 2.; V = 3.; W = 4.; P = 5.

open(newunit=ifile, file='interleaved.dat', form='unformatted', status='new')
call tick(timer)
do i=1,XX
    do j=1,YY
        do k=1,ZZ
            write(ifile) R(i,j,k)
            write(ifile) U(i,j,k)
            write(ifile) V(i,j,k)
            write(ifile) W(i,j,k)
            write(ifile) P(i,j,k)
        end do
    end do
end do
elapsed=tock(timer)
close(ifile)

print *,'Elapsed time for interleaved: ', elapsed

open(newunit=ifile, file='noninterleaved.dat', form='unformatted',status='new')
call tick(timer)
write(ifile) R
write(ifile) U
write(ifile) V
write(ifile) W
write(ifile) P
elapsed=tock(timer)
close(ifile)

print *,'Elapsed time for noninterleaved: ', elapsed

deallocate(R,U,V,W,P)
contains

subroutine tick(t)
    integer, intent(OUT) :: t

    call system_clock(t)
end subroutine tick

! returns time in seconds from now to time described by t 
real function tock(t)
    integer, intent(in) :: t
    integer :: now, clock_rate

    call system_clock(now,clock_rate)

    tock = real(now - t)/real(clock_rate)
end function tock

end program testoutput

Running gives

$ gfortran -Wall io-serial.f90 -o io-serial
$ ./io-serial 
 Elapsed time for interleaved:    225.755005    
 Elapsed time for noninterleaved:    4.01700020

As Rob Latham, who knows more than a few things about this stuff, says, your transposition for the parallel version is fine - it does the interleaving and transposing explicitly in memory, where it's much faster, and then blasts it out to disk. It's about as fast as the IO is going to get.

You can definitely avoid the dataTmp array by writing one or five individual data types to do the transposition/interleaving for you on the way out to disk via the MPI_File_write_all routine. That will give you a bit more of a balance in between in terms of memory usage and performance. You won't be explicitly defining a big 3-D array, but the MPI-IO code will improve performance over looping over individual elements by doing a fair bit of buffering, meaning that a certain amount of memory is being set aside to do the writing efficiently. The good news is that the balance will be tunable by setting MPI-IO hints in the Info variable; the bad news is that the code is likely to be less clear than what you have now.

Thank you for you answer. But, unfortunately, i cannt change the file format. I am clearly understand that this format is unconvenient, but I cannt. I am newbee in MPI. Could you possibly show some convenient example for what you describe (individual datatypes, etc) — Vasiliy, Oct 15 '14 at 15:25
When I posted my answer, I was wondering why @Jonathan Dursi had not yet provided an answer. Clearly he was spending a bit of effort to ensure he retained the title of King of StackOverflow. — Rob Latham, Oct 15 '14 at 19:59
@Vasiliy, perhaps you can consider pre-processing and post-processing to get the file into the format you want for performance? often a conversion step can happen quickly, though for large enough datasets, of course, there is a point where serial pre- or post- conversion is untenable. — Rob Latham, Oct 15 '14 at 20:04

MPI collective output 5 noncontiguous 3D arrays in special form

2 Answers2