3

I am working with a fortran project to simulate vegetation dynamic. The code is slow so I am always on the look for ways to optimize it. I have been reading that there exist a "rule" saying that usually 90% of the time is spent on 10% of the code. To find out these bottlenecks I have started using the intel VTune performance analyzer. The simulation analysis shows that a large amount of time is spent in specific parts of the code as shown in the images Figure 1. The most time consuming part of leaftw_derivs is shown in the next figure. Figure 2

The code referred to in the analysis is shown below.

   !---- Update soil moisture and energy from transpiration/root uptake. ------------------!
   if (rk4aux(ibuff)%any_resolvable) then
      do k1 = klsl, mzg    ! loop over extracted water
         do k2=k1,mzg
            if (rk4site%ntext_soil(k2) /= 13) then
               !---------------------------------------------------------------------------!
               !     Transpiration happens only when there is some water left down to this !
               ! layer.                                                                    !
               !---------------------------------------------------------------------------!
               if (rk4aux(ibuff)%avail_h2o_int(k1) > 0.d0) then
                  !------------------------------------------------------------------------!
                  !    Find the contribution of layer k2 for the transpiration from        !
                  ! cohorts that reach layer k1.                                           !
                  !------------------------------------------------------------------------!
                  ext_weight = rk4aux(ibuff)%avail_h2o_lyr(k2) / rk4aux(ibuff)%avail_h2o_int(k1)

                  !------------------------------------------------------------------------!
                  wloss_tot      = 0.d0
                  qloss_tot      = 0.d0
                  wvlmeloss_tot  = 0.d0
                  qvlmeloss_tot  = 0.d0

                  do ico=1,cpatch%ncohorts
                     !----- Find the loss from this cohort. -------------------------------!
                     wloss         = rk4aux(ibuff)%extracted_water(ico,k1) * ext_weight
                     qloss         = wloss * tl2uint8(initp%soil_tempk(k2),1.d0)
                     wvlmeloss     = wloss * wdnsi8 * dslzi8(k2)
                     qvlmeloss     = qloss * dslzi8(k2)
                     !---------------------------------------------------------------------!


                     !---------------------------------------------------------------------!
                     !      Add the internal energy to the cohort.  This energy will be    !
                     ! eventually lost to the canopy air space because of transpiration,   !
                     ! but we will do it in two steps so we ensure energy is conserved.    !
                     !---------------------------------------------------------------------!
                     dinitp%leaf_energy(ico) = dinitp%leaf_energy(ico)  + qloss
                     dinitp%veg_energy(ico)  = dinitp%veg_energy(ico)   + qloss
                     initp%hflx_lrsti(ico) = initp%hflx_lrsti(ico)      + qloss
                     !---------------------------------------------------------------------!

                     !----- Integrate the total to be removed from this layer. ------------!
                     wloss_tot     = wloss_tot     + wloss
                     qloss_tot     = qloss_tot     + qloss
                     wvlmeloss_tot = wvlmeloss_tot + wvlmeloss
                     qvlmeloss_tot = qvlmeloss_tot + qvlmeloss
                     !---------------------------------------------------------------------!
                  end do
                  !------------------------------------------------------------------------!



                  !----- Update derivatives of water, energy, and transpiration. ----------!
                  dinitp%soil_water   (k2) = dinitp%soil_water(k2)    - wvlmeloss_tot
                  dinitp%soil_energy  (k2) = dinitp%soil_energy(k2)   - qvlmeloss_tot
                  dinitp%avg_transloss(k2) = dinitp%avg_transloss(k2) - wloss_tot
                  !------------------------------------------------------------------------!
               end if
               !---------------------------------------------------------------------------!
            end if
            !------------------------------------------------------------------------------!
         end do
         !---------------------------------------------------------------------------------!
      end do
      !------------------------------------------------------------------------------------!
   end if
   !---------------------------------------------------------------------------------------!

I have a very basic understanding of optimization but I don't see what could be done here to improve the code. In particular I don't understand what Instructions Retired means and how to go about it. Is there a way here to speed up computations?

EDIT

Giving it a bit more thought I realized that there are some easy optimizations here. For example moving the conditional if (rk4aux(ibuff)%avail_h2o_int(k1) > 0.d0) then outside the loop as well as moving the tl2uint8(initp%soil_tempk(k2),1.d0) outside the innermost loop.

However I cannot really understand the reason for the supposedly long times VTune gives: the 3 lines

             dinitp%leaf_energy(ico) = dinitp%leaf_energy(ico)  + qloss
             dinitp%veg_energy(ico)  = dinitp%veg_energy(ico)   + qloss
             initp%hflx_lrsti(ico) = initp%hflx_lrsti(ico)      + qloss

are just performing an addition. This should be extremely fast but instead the analyzer says that a lot of time is spent there. Why would that be?

EDIT2

I rewrote the entire loop trying to optimize as much as I could. This is the code I came up with

   !---- Update soil moisture and energy from transpiration/root uptake. ------------------!
   if (rk4aux(ibuff)%any_resolvable) then
      do k1 = klsl, mzg    ! loop over extracted water

               !---------------------------------------------------------------------------!
               !     Transpiration happens only when there is some water left down to this !
               ! layer.                                                                    !
               !---------------------------------------------------------------------------!
               if (rk4aux(ibuff)%avail_h2o_int(k1) > 0.d0) then

                wloss_tot_k1 = 0.d0

                do ico=1,cpatch%ncohorts
                     !----- Integrate the total to be removed from this layer. ------------!
                     wloss_tot_k1 = wloss_tot_k1 + rk4aux(ibuff)%extracted_water(ico,k1)                     
                     !---------------------------------------------------------------------!
                end do
                  !------------------------------------------------------------------------!

                  do k2=k1,mzg
                    if (rk4site%ntext_soil(k2) /= 13) then
                  do ico=1,cpatch%ncohorts
                     wloss         = rk4aux(ibuff)%extracted_water(ico,k1) * ext_weight
                     uint_here1    = wloss * uint_here

                     dinitp%leaf_energy(ico) = dinitp%leaf_energy(ico) + uint_here1
                     dinitp%veg_energy(ico)  = dinitp%veg_energy(ico)  + uint_here1
                     initp%hflx_lrsti(ico)   = initp%hflx_lrsti(ico)   + uint_here1
                  end do
                  !------------------------------------------------------------------------!

                  wloss_tot     = wloss_tot_k1 * ext_weight                   
                  wvlmeloss_tot = wloss_tot * dslzi8(k2) * wdnsi8
                  qvlmeloss_tot = wloss_tot * dslzi8(k2) * uint_here


                  !----- Update derivatives of water, energy, and transpiration. ----------!
                  dinitp%soil_water   (k2) = dinitp%soil_water(k2)    - wvlmeloss_tot
                  dinitp%soil_energy  (k2) = dinitp%soil_energy(k2)   - qvlmeloss_tot
                  dinitp%avg_transloss(k2) = dinitp%avg_transloss(k2) - wloss_tot
                  !------------------------------------------------------------------------!


               end if
               !---------------------------------------------------------------------------!
            end do
            !------------------------------------------------------------------------------!
         end if
         !---------------------------------------------------------------------------------!
      end do
      !------------------------------------------------------------------------------------!
   end if
   !---------------------------------------------------------------------------------------!

It's a bit long so I don't expect people to go through it. If I run the analyzer now I get considerably reduced times (from 290s to 185s, although in real simulations the speed up seems to be slightly less). New times

However when looking at the sampling there is still a considerable amount of time spent in operations that I would not expect to be "expensive". I still don't get what Retired instructions means and how to go about it. For the moment I think this is enough and I guess that the proper way of getting a further speed up would be to make use of openMP capability as Holmz is suggesting.

enter image description here enter image description here

Manfredo
  • 1,760
  • 4
  • 25
  • 53
  • The compile option -qopt-report=4 (linux spelling) should produce a report about optimizations which would be important to understanding this. If a large amount of time is spent in those additions, it could be from waiting for the operands to become available; the purpose of VTune is to facilitate investigation as well as to identify hot spots. ifort doesn't necessarily optimize well a case like this with several sum reductions in parallel; if that is the problem, it will take some experiments to find improvement. – tim18 Aug 01 '17 at 16:16
  • Assuming that your comment means you were successful in moving the exponentiation ahead of the loop, you should show the corresponding VTune profile and compiler optimization report. – tim18 Aug 02 '17 at 01:15
  • Yes I'm still testing this portion with the modifications. I will try to post an update asap. – Manfredo Aug 02 '17 at 07:09
  • !$OMP SIMD REDUCTION clause may help the three summing lines. Possibly three loops may be better than one. Those are part of a structure/TYPE... So they are NOT linear/CONTIGUOUS and your stride is not 1, and the memory addresses are all over the place. I would try busting those fields out into separate vector/arrays, ... So basically testing only that "function" in a standalone sense using a small program to drive it. – Holmz Aug 02 '17 at 21:04
  • I am not too familiar with OpenMP so for the moment I wanted to concentrate my efforts on just plain optimization. As I specified in the Edit I already got a pretty decent improvement. I guess that taking care of the 5 or so most time consuming portions of the code could already give a 50% speed up. Only problem is that now most of the bottlenecks seem to be the mathematical functions (that I cannot pinpoint the location of). – Manfredo Aug 03 '17 at 08:57
  • Would it make any sense to convert `wloss` into an array so you could do `wloss(ico)=rk4aux(ibuff)%extractedwater(ico,k1)⋅extweight` or even `wloss(1:cpatch%ncohorts)=rk4aux(ibuff)%extractedwater(1:cpatch%ncohorts,k1)⋅extweight` and later `dinitp%leaf_energy(1:cpatch%ncohorts) = dinitp%leaf_energy(1:cpatch%ncohorts) + uint_here * wloss(1:cpatch%ncohorts)` outside the loop; i.e. trade space for speed. I'm curious if array operations and contiguous memory buy you any speed. – arclight Aug 04 '17 at 05:35

0 Answers0