Cache Blocking and Prefetching

Question

I'm trying to test the effectiveness of a manual cache blocking or loop tiling optimization that has been applied on some Fortran scientific code routine. Concerning Tile Size Selection, I used an algorithm based on classical Distinct Lines Estimation. I am using Intel Fortran Compiler ifort 18.0.1 (2018)

The code is compiled with O3 xHost compilation flags. To observe any speed-up between base version and tiled version I have to switch the prefetching level to 2 (by using -qopt-prefetch=2). By doing that I actually obtain a 27% of speedup (24 seconds versus 33 seconds). With normal O3 xHost the Execution Time remains unimproved (20 seconds) - so I get no difference between base and tiled.

A simple loop nest is the following, the base version:

DO jk = 2, jpkm1        ! Interior value ( multiplied by wmask)
    DO jj = 1, jpj
        DO ji = 1, jpi
            zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
            zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
            zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
        END DO
    END DO
END DO

and the optimized version:

DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY    
    DO jk = 2, jpkm1
        DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
            DO ji = 1, jpi
                zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
                zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
                zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
            END DO
        END DO  
    END DO  
END DO

Why can't I observe any speedup with the O3 xHost normal run? The problem should be the aggressive SW prefetching introduced by O3 (which should be the effect of the -qopt-prefetch=3 O3 optimization flag), but I would know whether I can further optimize with cache blocking. I have tried some handmade SW prefetching like this:

DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY    
     DO jk = 2, jpkm1
        DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
           DO ji = 1, jpi
              zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
              zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
              zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
              IF(jk== jpkm1 .AND. jj == MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)-2) THEN
                    CALL mm_prefetch(pwn(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)               
                    CALL mm_prefetch(zwz(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)               
                    CALL mm_prefetch(ptb(1,jltj+OBS_UPSTRFLX_TILEY,1,jn), 1)
                    CALL mm_prefetch(wmask(1,jltj+OBS_UPSTRFLX_TILEY,1), 1)
              ENDIF
            END DO
        END DO  
    END DO  
 END DO

but this doesn't seem to help me. Any kind of suggestion will be greatly thankful.

Best regards.

What are the dimensions of the objects you are talking about (`pwn`, `ptb`, `zwz`, ...). On top of that : `zfp_wk` is or `0` or `2*pwn(ji,jj,jk)` and dito for `zfm_wk`. Which probably means that `zwz` can be computed using `WHERE`. — kvantour, Apr 26 '18 at 16:08
I don't think `where` can make anything faster, I would expect quite the opposite. It is often better to compute more, even if useless, and avoid branching. — Vladimir F Героям слава, Apr 26 '18 at 16:22
The dimensions of these objects are `jpi X jpj X jpk`. Based on loop bounds, these dimensions and subscripts I calculated the tile size by using Distinct Lines Estimation. However, with normal O3 xHost, varying the tile size does not have any effect. It seems that aggressive prefetching interferes with Cache Blocking, but I'm interested to know if I can improve more with cache blocking — Marco Chiarelli, Apr 27 '18 at 09:22

Cache Blocking and Prefetching

0 Answers0