Using nd_range lets you specify your local range explicitly. In order to be able to place a work-group barrier in your kernel, you would also need to use nd_item instead of id to get access to more id locations and sizes such as global and local id, group range and local range, as well as the barrier synchronisation primitive.
You can then place a barrier upon finishing reading/writing to device local memory (using a device-only local accessor).
Whereas the use of range and id cannot get you any of that functionality. It is only intended to simplify the command group setup and the writing of global memory kernels where you want the runtime to decide on work-group sizes for you and have a simple way to index your work-items as opposed to the traditional OpenCL approach where you have to always explicitly define NDRange (nd_range in SYCL) no matter how simple or complex your kernels are.
Here's a simple example, assuming you want to launch a 2D kernel.
myQueue.submit([&](cl::sycl::handler& cgh) {
auto A_ptr = A_buf.get_access<cl::sycl::access::mode::read>(cgh);
auto B_ptr = B_buf.get_access<cl::sycl::access::mode::read_write>(cgh);
auto C_ptr = C_buf.get_access<cl::sycl::access::mode::write>(cgh);
// scratch/local memory for faster memory access to compute the results
cl::sycl::accessor<int, 1, cl::sycl::access::mode::read_write,
cl::sycl::access::target::local>
C_scratch(range<1>{size}, cgh);
cgh.parallel_for<example_kernel>(
cl::sycl::nd_range<2>(range<2>{size >> 3, size >> 3}, // 8, 8
range<2>{size >> 4, size >> 4}), // 4, 4
[=](cl::sycl::nd_item<2> item) {
// get the 2D x and y indices
const auto id_x = item.get_global_id(0);
const auto id_y = item.get_global_id(1);
// map the 2D x and y indices to a single linear,
// 1D (kernel space) index
const auto width =
item.get_group_range(0) * item.get_local_range(0);
// map the 2D x and y indices to a single linear,
// 1D (work-group) index
const auto index = id_x * width + id_y;
// compute A_ptr * B_ptr into C_scratch
C_scratch[index] = A_ptr[index] * B_ptr[index];
// wait for result to be written (sync local memory read_write)
item.barrier(cl::sycl::access::fence_space::local_space);
// output result computed in local memory
C_ptr[index] = C_scratch[index];
});
});
I am using a 1D representation of both the host data and the SYCL buffers, which explains the mapping from 2D indices to a single linear, 1D index.
I hope this explanation helps to apply those concepts in your case.