0

Final edit:

Some users on the silverfrost forums directed me very helpfully, to a simplification of the code and a solution.

The issue can be replicated using the following code:

      PROGRAM ML14ERROR   
        INTEGER :: origzn, destzn
        INTEGER,PARAMETER :: MXZMA = 1713, LXTZN = 1714, MXAV = 182
        INTEGER,PARAMETER :: JTMPREL = 1003, av = 1
        REAL(KIND=2) :: RANDOM@
        REAL,dimension (1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL):: znzndaav

        DO origzn=1,lxtzn
          DO destzn=1,lxtzn
            znzndaav(av,origzn,destzn,JTMPREL) = RANDOM@()
          END DO
        END DO

        DO origzn=1,mxzma
          DO destzn=1,mxzma
            ! This is where the error occurs
            znzndaav(av,origzn,lxtzn,JTMPREL)=
     $         znzndaav(av,origzn,lxtzn,JTMPREL)+
     $         znzndaav(av,origzn,destzn,JTMPREL)

          ENDDO
        ENDDO

        WRITE(6,*)'No errors'
      END PROGRAM 

The issue only arises when MXAV>182, which suggests a memory issue. Indeed, multiplying out the dimensions: 183 * 1714 * 1714 * 4 yields >2GB, exceeding the stack size.

A solution would be to use the heap as follows (Fortan 95):

PROGRAM ML14ERROR    
  INTEGER :: origzn, destzn
  INTEGER,PARAMETER :: MXZMA = 1713, LXTZN = 1714, MXAV = 191
  INTEGER,PARAMETER :: JTMPREL = 1003, av = 1
  REAL(KIND=2) :: RANDOM@
  REAL,allocatable :: znzndaav(:,:,:,:)

  ALLOCATE( znzndaav(1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL) )
  DO origzn=1,lxtzn
    DO destzn=1,lxtzn
      znzndaav(av,origzn,destzn,JTMPREL) = RANDOM@()
    END DO
  END DO

  DO origzn=1,mxzma
    DO destzn=1,mxzma
      ! This is where the error occurs
      znzndaav(av,origzn,lxtzn,JTMPREL)= &
  &         znzndaav(av,origzn,lxtzn,JTMPREL)+ &
  &         znzndaav(av,origzn,destzn,JTMPREL)

    ENDDO
  ENDDO
  DEALLOCATE(znzndaav)
  
  WRITE(6,*)'No errors'
END PROGRAM

Once we do this, we can allocate more than 2GB and the array works fine. The program this small section of code stems from is a few years old, and we've only just now run into the issue because a model we've built is many times larger than any before. As Fortran 77 doesn't allow ALLOCATABLE arrays, we must either reduce stack usage, or port the code - or seek another optimisation.


Edited to add:

I have now put together a git repo which contains reproducible code.


Overview

I have a program that works fine when compiled to 32-bit, but presents an access violation error when compiled and run in 64-bit.

I'm using the Silverfrost Fortran compiler, FTN95 v8.51, though this issue occurs using v8.40 and v8.50.


Sample code

! .\relocmon.inc
      INTEGER JTMPREL
      PARAMETER(JTMPREL=1003)
      REAL znda(lxtzn,JTMPREL:JTMPREL)
      REAL zndaav(1:mxav,lxtzn,JTMPREL:JTMPREL)      
      REAL,dimension (lxtzn,lxtzn,JTMPREL:JTMPREL) :: znznda
      REAL mlrlsum(lxtzn,lxtzn)

      REAL,dimension (1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL):: znzndaav

      COMMON /DDMON/ znda, znznda, mlrlsum,znzndaav, zndaav
! EOF .\relocmon.inc

! .\relocmon.inc with values
      INTEGER JTMPREL
      PARAMETER(JTMPREL=1003)
      REAL znda(1714,JTMPREL:JTMPREL)
      REAL zndaav(1:191,1714,JTMPREL:JTMPREL)      
      REAL,dimension (1714,1714,JTMPREL:JTMPREL) :: znznda
      REAL mlrlsum(1714,1714)

      REAL,dimension (1:191,1714,1714,JTMPREL:JTMPREL):: znzndaav

      COMMON /DDMON/ znda, znznda, mlrlsum,znzndaav, zndaav
! EOF .\relocmon.inc

! .\main.for
        INCLUDE 'relocmon.inc'
        
        REAL,save,dimension(lxtzn,lxtzn,mxav) :: ddfuncval
        
        DO origzn=1,mxzma
          IF( zonedef(origzn,JZUSE) )THEN
            DO destzn=1,mxzma
              IF (zonedef(destzn,JZUSE)) THEN
                znznda(origzn,destzn,JTMPREL)=znda(destzn,JTMPREL)*
     $                                       ddfuncval(origzn,destzn,av)            

               znznda(origzn,lxtzn,JTMPREL)=znznda(origzn,lxtzn,JTMPREL)
     $               +znznda(origzn,destzn,JTMPREL)
     
         znzndaav(av,origzn,destzn,JTMPREL)=zndaav(av,destzn,JTMPREL)*
     $                                    ddfuncval(origzn,destzn,av)           

         ! LINE 309 -- where error occurs
         znzndaav(av,origzn,lxtzn,JTMPREL)=
     $               znzndaav(av,origzn,lxtzn,JTMPREL)
     $             +znzndaav(av,origzn,destzn,JTMPREL)
     
              ENDIF
            ENDDO
          ENDIF
        ENDDO

! EOF .\main.for

NB the function zonedef simply checks that a zone is valid for the calculation we want to undertake. This function returns a logical.


Debugging

As I mentioned initially, the 32-bit compiled version of this program works fine. When attempting to run the 64-bit version, the output of the first loop is this:

from sdbg64.exe:

Error: Access Violation reading address
0x00000002071E05A0

main.for: 309

write exception to file:

Access violation (c0000005) at address 43a1f4

Within file ml14.exe
in main in line 309, at address 2b84

RAX = 0000000000000001   RBX = 000000027fff704c   RCX = 000000000285e6b8   RDX = 00000002802296cc
RBP = 0000000000400000   RSI = 000000029ba3ad6c   RDI = 0000000307695374   RSP = 000000000285be70
R8  = 0000000307695374   R9  = 00000002ffff5040   R10 = 000000029ba3ad6c   R11 = 000000030731f0dc
R12 = 000000027fff5584   R13 = 00000002802296cc   R14 = 000000028169f3ec   R15 = 0000000281660928

43a1f4) addss       XMM11,[85b401b4++R14]

For the rest of this... please bear with me. I'm not a trained software engineer or fortran developer by any stretch, so I'm stabbing in the dark a little to troubleshoot.

The value for ZNZNDAAV(1,337,337,1003) is 2.241640, and this is being added to ZNZNDAAV(1,337,1714,1003). This tallies with register XMM11 as detailed in the exception output. This value is at address 000000029BA3BD60. The other value is at address 00000003071E05A0.

IIUC, in relocmon.inc we're setting COMMON /DDMON/ to contain the dimensioned array znzndaav, so if the software were working nominally, the address of the value in question would be within the /DDMON/ block. The address range for /DDMON/ is z'000000027FFF6040' - z'0000000307421150'. If my logic is correct, the violation occurs outside of this block.

It appears to me that the program is attempting to write to 00000002071E05A0 when it should be using 00000003071E05A0.

Can anyone help me determine why this would be the case? There appears to be something systematic about it - could it be mere coincidence?

Community
  • 1
  • 1
ahalls
  • 1
  • 1
  • I tried to reproduce and failed, i.e. no crash for me. However, I tried on Linux 64-bit using GFortran. You might want to make the repo easier to compile, to get better responses. Since I do not use whatever IDE you use, I had to go digging into the makefiles to figure out what to compile and how. Hint: Do not use compiler-dependent KIND-numbers. – rtoijala Aug 01 '19 at 17:23
  • Thanks, rtoijala. I'm very new to this: the example I put together was too complex for the issue at hand - had I have reduced it down, I would have seen the root cause of the error very quickly. Please see latest edit to question. – ahalls Aug 02 '19 at 14:03

0 Answers0