I'm trying to test the effectiveness of a manual cache blocking or loop tiling optimization that has been applied on some Fortran scientific code routine. Concerning Tile Size Selection, I used an algorithm based on classical Distinct Lines Estimation. I am using Intel Fortran Compiler ifort 13.0.0 (2012)
To observe some Execution Time speed-up, I have to switch -O2
optimization flag (there IS a 10% of speed-up between -O2 code WITH manual cache blocking and -O2
code without manual cache blocking). If I set -O3 or -O3 -xHost
, then the Execution Time remain unimproved (more or less equal to the Execution Time of the base code without manual cache blocking, compiled with -O3 -xHost
).
Notice that vectorization is present only with -O3 -xHost
compiler flags. But with only -O3
still I can't observe any speed-up
. SO the question is:
What are the optimization(s) that are actually interfering with the manual cache blocking at O2?
Here there is the Intel HLO (High Level Optimizer) report of an -O3
only compilation of the manually tiled code:
HLO REPORT LOG OPENED ON Mon Mar 5 10:41:19 2018
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;-1:-1;hlo;traadv_fct_mp_tra_adv_fct_;0>
High Level Optimizer Report (traadv_fct_mp_tra_adv_fct_)
Unknown loop at line #346
Perfect Nest of depth 2 at line 226
Perfect Nest of depth 2 at line 232
Perfect Nest of depth 2 at line 251
Perfect Nest of depth 2 at line 251
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 254
Perfect Nest of depth 2 at line 257
Perfect Nest of depth 2 at line 257
Perfect Nest of depth 2 at line 276
Perfect Nest of depth 2 at line 277
Perfect Nest of depth 2 at line 296
Perfect Nest of depth 2 at line 296
Perfect Nest of depth 2 at line 296
Perfect Nest of depth 2 at line 296
Perfect Nest of depth 2 at line 313
Perfect Nest of depth 2 at line 314
Perfect Nest of depth 2 at line 325
Perfect Nest of depth 2 at line 325
Perfect Nest of depth 2 at line 325
Perfect Nest of depth 2 at line 325
Perfect Nest of depth 2 at line 361
Perfect Nest of depth 3 at line 361
Perfect Nest of depth 2 at line 361
Adjacent Loops: 3 at line 361
Perfect Nest of depth 2 at line 361
Perfect Nest of depth 3 at line 361
Perfect Nest of depth 2 at line 361
Perfect Nest of depth 2 at line 374
Perfect Nest of depth 2 at line 377
Perfect Nest of depth 2 at line 377
Perfect Nest of depth 2 at line 377
Perfect Nest of depth 2 at line 377
Perfect Nest of depth 2 at line 378
Perfect Nest of depth 2 at line 378
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 382
Perfect Nest of depth 2 at line 400
Perfect Nest of depth 2 at line 400
Perfect Nest of depth 2 at line 401
Perfect Nest of depth 2 at line 401
Perfect Nest of depth 2 at line 402
Perfect Nest of depth 2 at line 402
Perfect Nest of depth 2 at line 406
Perfect Nest of depth 2 at line 407
Perfect Nest of depth 2 at line 408
Perfect Nest of depth 2 at line 412
Perfect Nest of depth 2 at line 412
Perfect Nest of depth 2 at line 416
Perfect Nest of depth 2 at line 416
Perfect Nest of depth 2 at line 417
QLOOPS 246/246/0 ENODE LOOPS 246 unknown 1 multi_exit_do 0 do 245 linear_do 233 lite_throttled 0
LINEAR HLO EXPRESSIONS: 1900 / 5384 + LINEAR(innermost): 1628 / 5384
------------------------------------------------------------------------------
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;200:200;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 200=9
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;216:216;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 216=4
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 216=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;239:239;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 239=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;267:267;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 267=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;281:281;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 281=3
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;289:289;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 289=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;301:301;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 301=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;318:318;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 318=3
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;330:330;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 330=3
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;352:352;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 352=1
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 352=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;361:361;hlo_distribution;in traadv_fct_mp_tra_adv_fct_;0>
LOOP DISTRIBUTION in traadv_fct_mp_tra_adv_fct_ at line 361
Estimate of max_trip_count of loop at line 361=12
Estimate of max_trip_count of loop at line 361=12
Estimate of max_trip_count of loop at line 361=12
Estimate of max_trip_count of loop at line 361=12
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;365:365;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 365=1
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;389:389;hlo_scalar_replacement;in traadv_fct_mp_tra_adv_fct_;0>
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 389=1
#of Array Refs Scalar Replaced in traadv_fct_mp_tra_adv_fct_ at line 389=1
Loop dual-path report:
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;179:179;hlo;traadv_fct_mp_tra_adv_fct_;0>
Loop at 179 -- selected for multiversion- Assume shape array stride tests
Loop at 179 -- selected for multiversion- Assume shape array stride tests
Loop at 179 -- selected for multiversion- Assume shape array stride tests
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;184:184;hlo;traadv_fct_mp_tra_adv_fct_;0>
Loop at 184 -- selected for multiversion- Assume shape array stride tests
Loop at 188 -- selected for multiversion- Assume shape array stride tests
</users/home/mc28217/dev_HPC_Gyre_benchmark_test_trunk_2/NEMOGCM/CONFIG/GYRE_BENCHMARK_BLKD/BLD/ppsrc/nemo/traadv_fct.f90;190:190;hlo;traadv_fct_mp_tra_adv_fct_;0>
Loop at 190 -- selected for multiversion- Assume shape array stride tests
Based on these results from opt-report
, I tried to completely disable the scalar replacement optimization and I managed to remove loop fusion with a compiler directive from the various loops. Despite this attempt, I cannot see any difference.
What could be the interfering optimization introduced by -O3
?
Some information: Because for license reasons I cannot post code. I have thirteen 3D loops, and based on the Distinct Lines Estimation analysis, I tiled the centermost loop of every loop nest.
EDIT: This is a loop nest example:
DO jk = 2, jpkm1
DO jltj = 1, jpj, OBS_UPSTRFLX_TILEY
DO jj = jltj, MIN(jpj, jltj+OBS_UPSTRFLX_TILEY-1)
DO ji = 1, jpi
zfp_wk = pwn(ji,jj,jk) + ABS( pwn(ji,jj,jk) )
zfm_wk = pwn(ji,jj,jk) - ABS( pwn(ji,jj,jk) )
zwz(ji,jj,jk) = 0.5 * ( zfp_wk * ptb(ji,jj,jk,jn) + zfm_wk * ptb(ji,jj,jk-1,jn) ) * wmask(ji,jj,jk)
END DO
END DO
END DO
END DO
Other loop nests are more or less the same, with tiling performed on the centermost loop.