Final edit:
Some users on the silverfrost forums directed me very helpfully, to a simplification of the code and a solution.
The issue can be replicated using the following code:
PROGRAM ML14ERROR
INTEGER :: origzn, destzn
INTEGER,PARAMETER :: MXZMA = 1713, LXTZN = 1714, MXAV = 182
INTEGER,PARAMETER :: JTMPREL = 1003, av = 1
REAL(KIND=2) :: RANDOM@
REAL,dimension (1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL):: znzndaav
DO origzn=1,lxtzn
DO destzn=1,lxtzn
znzndaav(av,origzn,destzn,JTMPREL) = RANDOM@()
END DO
END DO
DO origzn=1,mxzma
DO destzn=1,mxzma
! This is where the error occurs
znzndaav(av,origzn,lxtzn,JTMPREL)=
$ znzndaav(av,origzn,lxtzn,JTMPREL)+
$ znzndaav(av,origzn,destzn,JTMPREL)
ENDDO
ENDDO
WRITE(6,*)'No errors'
END PROGRAM
The issue only arises when MXAV
>182, which suggests a memory issue. Indeed, multiplying out the dimensions: 183 * 1714 * 1714 * 4 yields >2GB, exceeding the stack size.
A solution would be to use the heap as follows (Fortan 95):
PROGRAM ML14ERROR
INTEGER :: origzn, destzn
INTEGER,PARAMETER :: MXZMA = 1713, LXTZN = 1714, MXAV = 191
INTEGER,PARAMETER :: JTMPREL = 1003, av = 1
REAL(KIND=2) :: RANDOM@
REAL,allocatable :: znzndaav(:,:,:,:)
ALLOCATE( znzndaav(1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL) )
DO origzn=1,lxtzn
DO destzn=1,lxtzn
znzndaav(av,origzn,destzn,JTMPREL) = RANDOM@()
END DO
END DO
DO origzn=1,mxzma
DO destzn=1,mxzma
! This is where the error occurs
znzndaav(av,origzn,lxtzn,JTMPREL)= &
& znzndaav(av,origzn,lxtzn,JTMPREL)+ &
& znzndaav(av,origzn,destzn,JTMPREL)
ENDDO
ENDDO
DEALLOCATE(znzndaav)
WRITE(6,*)'No errors'
END PROGRAM
Once we do this, we can allocate more than 2GB and the array works fine. The program this small section of code stems from is a few years old, and we've only just now run into the issue because a model we've built is many times larger than any before. As Fortran 77 doesn't allow ALLOCATABLE arrays, we must either reduce stack usage, or port the code - or seek another optimisation.
Edited to add:
I have now put together a git repo which contains reproducible code.
Overview
I have a program that works fine when compiled to 32-bit, but presents an access violation error when compiled and run in 64-bit.
I'm using the Silverfrost Fortran compiler, FTN95 v8.51, though this issue occurs using v8.40 and v8.50.
Sample code
! .\relocmon.inc
INTEGER JTMPREL
PARAMETER(JTMPREL=1003)
REAL znda(lxtzn,JTMPREL:JTMPREL)
REAL zndaav(1:mxav,lxtzn,JTMPREL:JTMPREL)
REAL,dimension (lxtzn,lxtzn,JTMPREL:JTMPREL) :: znznda
REAL mlrlsum(lxtzn,lxtzn)
REAL,dimension (1:mxav,lxtzn,lxtzn,JTMPREL:JTMPREL):: znzndaav
COMMON /DDMON/ znda, znznda, mlrlsum,znzndaav, zndaav
! EOF .\relocmon.inc
! .\relocmon.inc with values
INTEGER JTMPREL
PARAMETER(JTMPREL=1003)
REAL znda(1714,JTMPREL:JTMPREL)
REAL zndaav(1:191,1714,JTMPREL:JTMPREL)
REAL,dimension (1714,1714,JTMPREL:JTMPREL) :: znznda
REAL mlrlsum(1714,1714)
REAL,dimension (1:191,1714,1714,JTMPREL:JTMPREL):: znzndaav
COMMON /DDMON/ znda, znznda, mlrlsum,znzndaav, zndaav
! EOF .\relocmon.inc
! .\main.for
INCLUDE 'relocmon.inc'
REAL,save,dimension(lxtzn,lxtzn,mxav) :: ddfuncval
DO origzn=1,mxzma
IF( zonedef(origzn,JZUSE) )THEN
DO destzn=1,mxzma
IF (zonedef(destzn,JZUSE)) THEN
znznda(origzn,destzn,JTMPREL)=znda(destzn,JTMPREL)*
$ ddfuncval(origzn,destzn,av)
znznda(origzn,lxtzn,JTMPREL)=znznda(origzn,lxtzn,JTMPREL)
$ +znznda(origzn,destzn,JTMPREL)
znzndaav(av,origzn,destzn,JTMPREL)=zndaav(av,destzn,JTMPREL)*
$ ddfuncval(origzn,destzn,av)
! LINE 309 -- where error occurs
znzndaav(av,origzn,lxtzn,JTMPREL)=
$ znzndaav(av,origzn,lxtzn,JTMPREL)
$ +znzndaav(av,origzn,destzn,JTMPREL)
ENDIF
ENDDO
ENDIF
ENDDO
! EOF .\main.for
NB the function zonedef
simply checks that a zone is valid for the calculation we want to undertake. This function returns a logical
.
Debugging
As I mentioned initially, the 32-bit compiled version of this program works fine. When attempting to run the 64-bit version, the output of the first loop is this:
from sdbg64.exe:
Error: Access Violation reading address
0x00000002071E05A0
main.for: 309
write exception to file:
Access violation (c0000005) at address 43a1f4
Within file ml14.exe
in main in line 309, at address 2b84
RAX = 0000000000000001 RBX = 000000027fff704c RCX = 000000000285e6b8 RDX = 00000002802296cc
RBP = 0000000000400000 RSI = 000000029ba3ad6c RDI = 0000000307695374 RSP = 000000000285be70
R8 = 0000000307695374 R9 = 00000002ffff5040 R10 = 000000029ba3ad6c R11 = 000000030731f0dc
R12 = 000000027fff5584 R13 = 00000002802296cc R14 = 000000028169f3ec R15 = 0000000281660928
43a1f4) addss XMM11,[85b401b4++R14]
For the rest of this... please bear with me. I'm not a trained software engineer or fortran developer by any stretch, so I'm stabbing in the dark a little to troubleshoot.
The value for ZNZNDAAV(1,337,337,1003)
is 2.241640, and this is being added to ZNZNDAAV(1,337,1714,1003)
. This tallies with register XMM11 as detailed in the exception output. This value is at address 000000029BA3BD60
. The other value is at address 00000003071E05A0
.
IIUC, in relocmon.inc we're setting COMMON /DDMON/
to contain the dimensioned array znzndaav
, so if the software were working nominally, the address of the value in question would be within the /DDMON/
block. The address range for /DDMON/
is z'000000027FFF6040' - z'0000000307421150'
. If my logic is correct, the violation occurs outside of this block.
It appears to me that the program is attempting to write to 00000002071E05A0
when it should be using 00000003071E05A0
.
Can anyone help me determine why this would be the case? There appears to be something systematic about it - could it be mere coincidence?