2

I am doing matrix multiplication in SYCL and have a working code in which i used only range in parallel_for instead of using nd_range in parallel_for. Now i want to use barriers in it and as far as i read barriers can only be used with nd_range, right?. I am attaching a portion of my code kindly tell me if it can be done without nd_range or what changes should i make with nd_range. Thanks

queue.submit([&](cl::sycl::handler &cgh) {
    auto A = A_sycl.get_access<cl::sycl::access::mode::read>(cgh);
    auto B = B_sycl.get_access<cl::sycl::access::mode::read>(cgh);
    auto C = C_sycl.get_access<cl::sycl::access::mode::write>(cgh);

    cgh.parallel_for<class test>(
        cl::sycl::range<2>(4, 4), [=](cl::sycl::id<2> id) {
        c_access[id] = A[id] * Y[id.get(1)];
    });

});
Karan Shah
  • 417
  • 6
  • 21
ZSA
  • 85
  • 1
  • 13

1 Answers1

2

Using nd_range lets you specify your local range explicitly. In order to be able to place a work-group barrier in your kernel, you would also need to use nd_item instead of id to get access to more id locations and sizes such as global and local id, group range and local range, as well as the barrier synchronisation primitive.

You can then place a barrier upon finishing reading/writing to device local memory (using a device-only local accessor).

Whereas the use of range and id cannot get you any of that functionality. It is only intended to simplify the command group setup and the writing of global memory kernels where you want the runtime to decide on work-group sizes for you and have a simple way to index your work-items as opposed to the traditional OpenCL approach where you have to always explicitly define NDRange (nd_range in SYCL) no matter how simple or complex your kernels are.

Here's a simple example, assuming you want to launch a 2D kernel.

myQueue.submit([&](cl::sycl::handler& cgh) {
    auto A_ptr = A_buf.get_access<cl::sycl::access::mode::read>(cgh);
    auto B_ptr = B_buf.get_access<cl::sycl::access::mode::read_write>(cgh);
    auto C_ptr = C_buf.get_access<cl::sycl::access::mode::write>(cgh);
    // scratch/local memory for faster memory access to compute the results
    cl::sycl::accessor<int, 1, cl::sycl::access::mode::read_write,
                       cl::sycl::access::target::local>
        C_scratch(range<1>{size}, cgh);

    cgh.parallel_for<example_kernel>(
        cl::sycl::nd_range<2>(range<2>{size >> 3, size >> 3},   // 8, 8
                              range<2>{size >> 4, size >> 4}),  // 4, 4
        [=](cl::sycl::nd_item<2> item) {
          // get the 2D x and y indices
          const auto id_x = item.get_global_id(0);
          const auto id_y = item.get_global_id(1);
          // map the 2D x and y indices to a single linear,
          // 1D (kernel space) index
          const auto width =
              item.get_group_range(0) * item.get_local_range(0);
          // map the 2D x and y indices to a single linear,
          // 1D (work-group) index
          const auto index = id_x * width + id_y;
          // compute A_ptr * B_ptr into C_scratch
          C_scratch[index] = A_ptr[index] * B_ptr[index];
          // wait for result to be written (sync local memory read_write)
          item.barrier(cl::sycl::access::fence_space::local_space);
          // output result computed in local memory
          C_ptr[index] = C_scratch[index];
        });
  });

I am using a 1D representation of both the host data and the SYCL buffers, which explains the mapping from 2D indices to a single linear, 1D index.

I hope this explanation helps to apply those concepts in your case.

  • Thanks for your detailed explanation it was very helpful, i will try to implement it. Thanks again – ZSA Oct 21 '19 at 11:11