0

Is that possible to start the do loop and the index is from 1 to n-2 using dpc++ parallel_for?

h.parallel_for(range{lx , ly }, [=](id<2> idx

this will give a do loop from 0 to lx-1, and I have to do

idx[0]>0 && idx[1]>0 && idx[0]<lx-1 && idx[1]<ly-1

and then I can complete the loop?

Also, does dpc++ support like 4D parallel_for?

Victor Eijkhout
  • 5,088
  • 2
  • 22
  • 23

1 Answers1

1

In SYCL 1.2.1, parallel_for supports offsets, so you could use h.parallel_for(range{lx-2, ly-2}, id{1, 1}, [=](id<2> idx){ ... });.

However, this overload has been deprecated in SYCL 2020:

Offsets to parallel_for, nd_range, nd_item and item classes have been deprecated. As such, the parallel iteration spaces all begin at (0,0,0) and developers are now required to handle any offset arithmetic themselves. The behavior of nd_item.get_global_linear_id() and nd_item.get_local_linear_id() has been clarified accordingly.

So, if you want to conform to the latest standard, you should apply the offset manually:

h.parallel_for(range{lx-2, ly-2}, [=](id<2> idx0) { id<2> idx = idx0 + 1; ... });

That said, depending on your data layout, your original approach of having "empty" threads might be faster.

Also, does dpc++ support like 4D parallel_for?

No. You will have to use 1D range and compute the 4D index manually.

aland
  • 4,829
  • 2
  • 24
  • 42
  • Thank you so much for your reply, can I ask one more question? what is the meaning of the nd_range? I understand that nd_range divided the global array into a local one and does the calculation separately, but what does that mean? – Mac cchiatooo Apr 27 '22 at 15:44
  • For most GPU hardware, the work-items (aka "threads" in CUDA) are organized into work-groups (aka "blocks"). Work-items in the same work-group can communicate efficiently with each other (local memory, barriers etc). `sycl::nd_range` allows you to explicitly specify the size of each work-group. If you don't need this in your kernel, then `sycl::range` is just fine. – aland Apr 27 '22 at 16:38