In SYCL 1.2.1, parallel_for
supports offsets, so you could use h.parallel_for(range{lx-2, ly-2}, id{1, 1}, [=](id<2> idx){ ... });
.
However, this overload has been deprecated in SYCL 2020:
Offsets to parallel_for
, nd_range
, nd_item
and item
classes have been deprecated. As such, the parallel iteration spaces all begin at (0,0,0)
and developers are now required to handle any offset arithmetic themselves. The behavior of nd_item.get_global_linear_id()
and nd_item.get_local_linear_id()
has been clarified accordingly.
So, if you want to conform to the latest standard, you should apply the offset manually:
h.parallel_for(range{lx-2, ly-2}, [=](id<2> idx0) { id<2> idx = idx0 + 1; ... });
That said, depending on your data layout, your original approach of having "empty" threads might be faster.
Also, does dpc++ support like 4D parallel_for?
No. You will have to use 1D range and compute the 4D index manually.