1

I'm kinda new to the world of FPGAs and I'm trying to port some code written for GPUs to FPGAs, to compare the performances.

From my understanding, using parallel_for ain't a good practice (in fact it runs very slow), instead (I think) I should use a single_task and an unrolled for loop. I'm struggling to make it work properly though.

So, I have

q.submit([&](sycl::handler &h){
   h.parallel_for<class Foo>(sycl::nd_range<1>(n_blocks * n_threads, n_threads),
          [=](auto& it) {
              some_kernel(it, <other params here ...> );
          });
}).wait();

and my attempt is

q.submit([&](sycl::handler &h){
   h.single_task<class Foo>(
     #pragma unroll
     for(int i = 0; i < n_blocks * n_threads; ++i)
        some_kernel(...)
   );
}).wait();

But I'm not sure how to adapt what I was previously doing with a sycl::item (for instance, how to use the loop index to replace the calls to the methods get_group, get_local_id? ).

Should I entirely change the design of the kernel ? In other word, is the "work_groups - work_group_size" approach not appropriate with FPGAs ?

Elle
  • 305
  • 2
  • 10

0 Answers0