0

I am using exactly the same code in two different clusters. One cluster runs with mpi intel fortran and the other runs with Cray Fortran. The former is an old cluster and the latter is the newest we have at school. The implementation woks perfectly on the old cluster (MPI INTEL FORTRAN) but the implementation does not work in the Cray Fortran cluster. The portion of the output subroutine that is giving error is this:

    Subroutine Output
use Variables
implicit none

! Formating section

398   format(6(e22.15,2x))
399   format(7(e22.15,2x))
39   format(5(e22.15,2x))

!!! Computing Cp for postprocessing purposes
Cp = gamma*R_gas/(gamma-1)

! Creating the Global mesh
If(MyRank ==0) Call GridGlobal
    If(MyRank==0) then
    open(330, file = 'Primitive_Variables.dat')
    write(330,*) 'TITLE = "Primitive Variables Contours"'
    write(330,*) 'VARIABLES = "X"'
    write(330,*) '"Y"'
    write(330,*) '"U Velocity"'
    write(330,*) '"V Velocity"'
    write(330,*) '"Density"'
    write(330,*) '"Temperature"'

    write(330,*)'  zone T = "zone1", I = ',ImaxGlobal,' J= ',JmaxGlobal,' F = point'

    do j = 1,JmaxGlobal
      do i = 1,ImaxGlobal
        write(330,398) xGlobal(i,j),yGlobal(i,j),u_oldGlobal(i,j),v_oldGlobal(i,j),r_oldGlobal(i,j),T_oldGlobal(i,j)
      enddo
    enddo
    close(330)
    End If

When I run my implementation the error that I get is the following:

Application 135822 exit codes: 134
Application 135822 exit signals: Killed
Application 135822 resources: utime ~185s, stime ~1s, Rss ~1444696, inblocks ~1023410, outblocks ~5529676

pwd

setenv OMP_NUM_THREADS 1

if ( -e IDS ) then
aprun -j 1 -n 32 ./IDS

sys-38 : UNRECOVERABLE error on system request
  Function not implemented

Encountered during a CLOSE of unit 330
Fortran unit 330 is not connected
_pmiu_daemon(SIGCHLD): [NID 00018] [c0-0c0s4n2] [Wed Aug  2 16:57:32 2017] PE RANK 0 exit signal Aborted
[NID 00018] 2017-08-02 17:53:44 Apid 135820: initiated application termination
else

exit

With this, the output subroutine stops printing the results and my computation is useless.

Thanks in advance

For the Record: This only occurs for big arrays. By big I mean greater than 2001x2001. I know that this is not big at all, but for smaller arrays the error does not pop up. The Subrotuine allocates the arrays required for the printing, it starts printing the file but it does not finish printing the whole file. After few elements the process stops and the error pops up. The program creates the file and it starts writing the solution in the file and then it stops printing the solution. I have tried running it with different number of PE and the problem always pops up.

The variables are declared in the following way:

integer, parameter :: dp = 8
real(kind=dp),dimension(:,:),allocatable::r_old,u_old,v_old,T_old,a_old

The code looks like:

        DO kk=1, 2001

 ! This section calls different subroutine
 ! They are not relevant for the discussion

Call MPI_BARRIER(MPI_COMM_WORLD,ierr) ! Barrier in MPI
Enddo

! Postprocessing tasks and restart file
Call MPI_BARRIER(MPI_COMM_WORLD,ierr) ! Barrier in MPI
Call KillArrays   ! Deallocating the arrays not needed for writing output
Call Write_SolutionRestart
Call Output
Call MPI_BARRIER(MPI_COMM_WORLD,ierr) 
Call MPI_FINALIZE(ierr)

The subroutine that is giving me problem is "Output".

The problem is always Rank 0 according to the error file.

  • This is obviously a runtime error not compiletime as you tagged, and fortran would be a good tag. By 'stops printing' do you mean it doesn't print at all, or it prints an initial part of the data? What is the format in 398 and in particular how much output space does it take? – dave_thompson_085 Aug 03 '17 at 02:00
  • @dave_thompson_085 I will change the tags, thanks though. Ok it stops printing the file, what it is even more difficult to understand is that the subroutine allocates the arrays for the printing, then, it starts printing the file and suddenly the error pops up and it quits printing and it goes out of the subroutine. – Julio Mendez Aug 03 '17 at 14:46
  • What does it print? When run multiple times does it always print the same amount to unit 330? How are your variables declared (please show at the very least the whole subprogram)? What is format 398? What happens if you put an MPI_Barrier over the appropriate communicator after the endif in the above? i.e. are you sure it is rank 0 that is causing the abort? – Ian Bush Aug 04 '17 at 07:49
  • @IanBush thanks for your suggestion, please find the information related to your inquiry above in the question. Thanks for your input.!! – Julio Mendez Aug 06 '17 at 17:06
  • @IanBush I tried placing the MPI_BARRIER before and after the if statement and nothing worked out!!! – Julio Mendez Aug 06 '17 at 19:40
  • Please show the **complete** subprogram as IanBush already suggested. Better read how to make a [mcve]. I suggest reading [ask] too. – Vladimir F Героям слава Aug 07 '17 at 09:05
  • Hi @VladimirF the complete subprogram has more than 1000 lines and they do not add much information; a bunch of do loops computing stresses and so forth. Here is the kicker, the output subroutine works perfectly in the other cluster with OpenMP and not for MPICH in the new cluster for sizes greater than 4001x4001. – Julio Mendez Aug 07 '17 at 13:51
  • If it has 1000 lines, please see [mcve]. Otherwise we can't really help you (as you can see by having no answers). – Vladimir F Героям слава Aug 07 '17 at 13:55
  • @VladimirF due to copyright restrictions and sensible information I cannot show much details about the code. The information shown is enough to find the issue. In fact, the problem is solved already and it was because cray Fortran is very strict with the fortran specification. I fixed the problem writing the open and close inside the main program and not in the subroutines. Thanks everyone who commented in this issue – Julio Mendez Aug 08 '17 at 15:31
  • In that case you can post an answer to explain the solution. An open question like this is not really useful and may attract downvotes. I certainly can't see the reason, even with your explanation. – Vladimir F Героям слава Aug 08 '17 at 15:46
  • And, again, a [mcve] is the thing you should **really** read. A good MCVE does not reveal any copyrighted information, because it is very small and illustrates the issue on a small piece of code. – Vladimir F Героям слава Aug 08 '17 at 16:07

0 Answers0