I'm trying to test the effectiveness of a manual cache blocking or loop tiling optimization that has been applied on some Fortran scientific code routine. Concerning Tile Size Selection, I used an algorithm based on classical Distinct Lines Estimation. I am using Intel Fortran Compiler ifort 18.0.1 (2018)
The code is compiled with O3 xHost compilation flags. To observe any speed-up between base version and tiled version I have to switch the prefetching level to 2 (by using -qopt-prefetch=2). By doing that I actually obtain a 27% of speedup (24 seconds versus 33 seconds). With normal O3 xHost the Execution Time remains unimproved (20 seconds) - so I get no difference between base and tiled.
A simple loop nest is the following, the base version:
DO jk = 2, jpkm1 ! Interior value ( multiplied by wmask)
DO jj = 1, jpj
DO ji = 1, jpi
zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
END DO
END DO
END DO
and the optimized version:
DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY
DO jk = 2, jpkm1
DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
DO ji = 1, jpi
zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
END DO
END DO
END DO
END DO
Why can't I observe any speedup with the O3 xHost normal run? The problem should be the aggressive SW prefetching introduced by O3 (which should be the effect of the -qopt-prefetch=3 O3 optimization flag), but I would know whether I can further optimize with cache blocking. I have tried some handmade SW prefetching like this:
DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY
DO jk = 2, jpkm1
DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
DO ji = 1, jpi
zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
IF(jk== jpkm1 .AND. jj == MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)-2) THEN
CALL mm_prefetch(pwn(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)
CALL mm_prefetch(zwz(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)
CALL mm_prefetch(ptb(1,jltj+OBS_UPSTRFLX_TILEY,1,jn), 1)
CALL mm_prefetch(wmask(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)
ENDIF
END DO
END DO
END DO
END DO
but this doesn't seem to help me. Any kind of suggestion will be greatly thankful.
Best regards.