How do I exploit OpenCL local memory for simple advection scheme (usptream bias or leap frog)?

Question

I wrote an OpenCL code to solve advection eqution using two different schemes: Upstream bias and Leapfrog scheme.

The code runs fine, but I would like to know if I could use OpenCL local memory to optimize the code. From my understanding, local memory is useful when we have something that could be shared within the local workgroup. But in my OpenCL kernel, other than the indexes, I do not have (or I could not think of, per se..) anything that should/could be shared.

Kernel for upstream bias scheme

kernel void upstream3d(
                        const  int nx,
                        const  int ny,
                        const  int nz,
                        global float *in_p_tf,
                        global float *in_p_tn,
                        const  float u_vel,
                        const  float v_vel,
                        const  float w_vel,
                        const  float C
                      )
{
   int i    =  get_global_id(0);
   int j    =  get_global_id(1);
   int k    =  get_global_id(2);

   int idx, idx_i, idx_j, idx_k;

   int c_xi = i % nx, 
       c_yi = j % ny, 
       c_zi = k % nz,
       m_xi = (i+nx-1)%nx, 
       m_yi = (j+ny-1)%ny, 
       m_zi = (k+nz-1)%nz;


   idx      =  c_xi + c_yi * nx + c_zi * nx * ny;

   idx_i    =  m_xi + c_yi * nx + c_zi * nx * ny;
   idx_j    =  c_xi + m_yi * nx + c_zi * nx * ny;
   idx_k    =  c_xi + c_yi * nx + m_zi * nx * ny;

   in_p_tf[idx]  = in_p_tn[idx] 
                 - u_vel * C * (in_p_tn[idx] - in_p_tn[idx_i])
                 - v_vel * C * (in_p_tn[idx] - in_p_tn[idx_j])
                 - w_vel * C * (in_p_tn[idx] - in_p_tn[idx_k]);

}

Kernel for Leapfrog scheme

kernel void leapfrog3d(
                        const  int nx,
                        const  int ny,
                        const  int nz,
                        global float *in_p_tf,
                        global float *in_p_tn,
                        global float *in_p_tp,
                        const  float u_vel,
                        const  float v_vel,
                        const  float w_vel,
                        const  float C
                      )
{
   int i    =  get_global_id(0);
   int j    =  get_global_id(1);
   int k    =  get_global_id(2);

   int idx0, idx_i0, idx_i1, idx_j0, idx_j1, idx_k0, idx_k1;

   int p_xi = (i+1)%nx,
       p_yi = (j+1)%ny, 
       p_zi = (k+1)%nz,
       c_xi = i % nx, 
       c_yi = j % ny, 
       c_zi = k % nz,
       m_xi = (i+nx-1)%nx, 
       m_yi = (j+ny-1)%ny, 
       m_zi = (k+nz-1)%nz;


   idx0     =  c_xi + c_yi * nx + c_zi * nx * ny;

   idx_i0   =  p_xi + c_yi * nx + c_zi * nx * ny;
   idx_j0   =  c_xi + p_yi * nx + c_zi * nx * ny;
   idx_k0   =  c_xi + c_yi * nx + p_zi * nx * ny;
   
   idx_i1   =  m_xi + c_yi * nx + c_zi * nx * ny;
   idx_j1   =  c_xi + m_yi * nx + c_zi * nx * ny;
   idx_k1   =  c_xi + c_yi * nx + m_zi * nx * ny;

   in_p_tf[idx0] = in_p_tp[idx0] 
                 - u_vel * C * (in_p_tn[idx_i0] - in_p_tn[idx_i1])
                 - v_vel * C * (in_p_tn[idx_j0] - in_p_tn[idx_j1])
                 - w_vel * C * (in_p_tn[idx_k0] - in_p_tn[idx_k1]);

   in_p_tn[i + j * nx + k * nx * ny] = in_p_tn[i + j * nx + k * nx * ny] 
                                       + 0.80 * (in_p_tf[i + j * nx + k * nx * ny] 
                                       - 2.0 * in_p_tn[i + j * nx + k * nx * ny] 
                                       + in_p_tp[i + j * nx + k * nx * ny]);

}

Is this all I can get from OpenCL, or am I missing something?

Thanks.

ProjectPhysX · Accepted Answer · 2021-12-12T22:39:24.333

Shared/local memory optimization sometimes is very useful (~10x gain in matrix multiplication), and sometimes it is not. In your case, it might be somewhat useful, but not much - you have to try. I have tested a similar application in lattice Boltzmann, but without any performance gains (could use it for 8/171 byte of memory transfer there, so I expected only insignificant gains in the first place).

Out of the 3 advection directions, you can only use shared memory for 1, in your case the x-direction. In the upstream3d kernel that is 4/20 byte of memory transfer, so expect at best 20% better performance.

A workgroup is a strip of grid cells along the x-direction. Each grid cell loads in_p_tn for itself and from 3 of its neighbors at one index lower. Neighbirs y-1 and z-1 are not located in the workgroup, so you can't use local memory for those. But the x-1 neighbors are, at least all but the left most one. So the strategy would be:

Load in_p_tn for each cell in the workgroup from global memory to private memory and then write it to a local memory array with the local_id. The leftmost thread must load its left neighbor x-1 additionally since that is outside of the workgroup strip*. The local memory array must have the dimensions workgroup size +1.
Make a barrier(CLK_LOCAL_MEM_FENCE);. After the barrier, the local array has been filled up.
Take in_p_tn for the current cell from private memory. For the x-1 neighbor, load it from local memory. Do the advection, write the result to global memory. Done!

*This can potentially eliminate any performance gain if you make the workgroup too small.

Another remark: In the upstream3d kernel, you currently load in_p_tn[idx] from global to private memory 4 times. Make a private variable and load it once, then use the private variable. Never trust the compiler in this regard.

I can't visualize how the 3d `in_p_tn` is loaded on to the workgroup (especially the part how it would only load i-th and (i-1) th value). From my understanding, a workgroup 'could' be in 3 dimension. Then for 3d advection (just like in this diagram: https://rocmdocs.amd.com/en/latest/_images/2.31.png), can't we use shared memory for all three directions? — Redshoe, Dec 13 '21 at 04:27
The linear indexing scheme `idx = c_xi + c_yi * nx + c_zi * nx * ny;` makes the workgroup aligned with the x-axis. If `workgroup size > nx`, a workgroup spans across 2 cells in y-direction also. But then, the local memory optimization would not work. I forgot to mention: `nx` must be a multiple of `workgroup size`. Also in theory, you could construct a workgroup indexing such that a workgroup covers a 4x4x4 cube for example; then you could make the local memory optimization a bit more efficient in all 3 directions. But it's likely not worth the added complexity and reduced maintainability. — ProjectPhysX, Dec 13 '21 at 09:17
Ahhh. Thanks. If I were to optimize for all directions (`x`, `y` and `z`), how do I implement the indexing? I mean if I implement it and it works well, I wouldn't really have to maintain it, right? — Redshoe, Dec 17 '21 at 19:02
Use standard 3D linear indexing for the work group ID and for the thread ID within the work group separately. Then you have a 3D grid of cubic work groups each containing a 3D grid of grid cells. You would have to handle the 3x16 advection neighbors coming in from the neighboring thread blocks explicitly and this is quite a mess in the code. That would elongate the code probably by at least 20x and I don't think that this is worth the small possible performance gain. — ProjectPhysX, Dec 22 '21 at 07:05

How do I exploit OpenCL local memory for simple advection scheme (usptream bias or leap frog)?

1 Answers1