Can I speed this up any more?

Question

I am interested in speeding up computation time for the subroutine compoundret which basically compounds a monthly return series over some holding period, say one month, three months, six months, etc. I will be calling this subroutine from R as a dll. I have written a main function in the attached code snippet to get everything working in fortran.

subroutine compoundret(R_c, R, RF, horizons, Tn, N, M)
  implicit none

  ! Arguments declarations
  integer, intent(in)  :: horizons(M), Tn, N, M
  real*8,  intent(in)  :: RF(Tn), R(Tn, N, M)
  real*8,  intent(out)  :: R_c(Tn, N, M)

  ! Intermediary Variables
  integer :: t, j, k
  real*8  :: RF_Temp(Tn, N, M)

  R_c = 0.0
  do t = 1, Tn
     RF_Temp(t,:,:) = RF(t)
  end do

  !$acc data copyin(r(Tn,N,M), RF_Temp(Tn,N,M), horizons(M)), create(R_c(Tn, 
  N, M))
  !$acc parallel loop
  do k = 1, M
    do j = 1, N
      do t = 1, Tn - horizons(k) + 1
        R_c( t, j, k) = PRODUCT( 1 + R( t:t + horizons(k) - 1, j, k) + &
                        RF_Temp( t:t + horizons(k) - 1, j, k)) - &
                        PRODUCT(1+ RF_Temp( t:t + horizons(k) - 1, j, k))
      end do
    end do
  end do
  !$acc end parallel
  !$acc update host(R_c)
  !$acc end data

end subroutine compoundret

Program main
    implicit none
    real*8  :: df(1000,5000, 6)
    real*8  :: retdata(size(df,1),size(df,2),size(df,3)),RF(size(df,1))
    integer :: horizons(6), Tn, N, M

  Tn = size(df, 1)
  N  = size(df, 2)
  M  = size(df, 3)

    df = 0.001
    RF = 0.001
    horizons(:) = (/1,3,6,12,24,48/)

    call compoundret(retdata,df,RF,horizons, Tn, N, M)
print*, retdata(1, 1, 1:M)

end program My target platform is a compute 6.0 device (GTX 1060).

@Chiel It is an OpenACC code but I wouldn't mind totally offloading everything inside the loop to the GPU via a Cuda Fortran kernel. — Baba Yara, Dec 25 '17 at 08:47
Check how long does the allocation of the temporary array take and how long do the memory transfers to GPU take. — Vladimir F Героям слава, Dec 25 '17 at 09:03
@VladimirF is this allowed in Fortran? In C/C++ this notation would imply a stack allocation, and will result most likely into a stack overflow, as the OP would probably not want to optimize this code if the arrays weren't large. — Chiel, Dec 25 '17 at 11:10
The compiler can allocate in any way it wants. Stack allocation is not implied. Its is 228 MB in this program. — Vladimir F Героям слава, Dec 25 '17 at 11:59
Why RF_temp? Your product expressions are all 1D arrays so why introduce the huge 3d temporary? — Ian Bush, Dec 25 '17 at 12:23
It seems like you are defining that big temporary array just for the convenience of doing array addition. Here you might be better off to use an explicit loop for the sum ( the argument to `product`. ). It also might be better to make `t` the outer loop. — agentp, Dec 25 '17 at 18:22
@VladimirF It spends about one third of the time copying data to and off the gpu. I can reduce that my doing an asynchronous transfer and compute but that wont get me much if a speed up. I will have do that to actually check first though. — Baba Yara, Dec 25 '17 at 20:08
@IanBush I use RF_temp to generate an array with a size similar to R. I was initially getting "data partially on gpu" error when I was just using the one dimensional version of the array. And I just updated the code to correctly set RF. Using the one D array version of RF works fine on the CPU, but I cannot seem to offload it onto the GPU for some reason. — Baba Yara, Dec 25 '17 at 20:17
@agentp I kept the t as the inner most loop because I thought there would be some memory efficiency, with fortran reading down the columns. But I will try out your tweaks and see how far they help. Thanks for the suggestions — Baba Yara, Dec 25 '17 at 20:20
yes changing the loop order might require changing array index order. — agentp, Dec 25 '17 at 22:27

score 1 · Answer 1 · answered Dec 26 '17 at 15:52

I'd recommend collapsing the two outer loops and then adding "!$acc loop vector" on the "t" loop.

  !$acc parallel loop collapse(2)
  do k = 1, M
    do j = 1, N
  !$acc loop vector
      do t = 1, Tn - horizons(k) + 1
        R_c( t, j, k) = PRODUCT( 1 + R( t:t + horizons(k) - 1, j, k) + &
                        RF_Temp( t:t + horizons(k) - 1, j, k)) - &
                        PRODUCT(1+ RF_Temp( t:t + horizons(k) - 1, j, k))
      end do
    end do
  end do
  !$acc end parallel

Right now, you're only parallelizing the outer loop and since "M" is quite small, you're underutilizing the GPU.

Note that the PGI 2017 compilers have a bug which will prevent you from using OpenACC within a DLL (shared objects on Linux are fine). We're working on fixing this issue in the 18.1 compilers. You're current options are to either wait till 18.1 is released early next year or go back to the 16.10 compilers. If you're using the PGI Community edition, you'll need to wait for the 18.4 compilers in April.

Also, putting OpenACC in shared or dynamic libraries require the use of the "-ta=tesla:nordc" option.

Can I speed this up any more?

1 Answers1