I'm kinda new to the world of FPGAs and I'm trying to port some code written for GPUs to FPGAs, to compare the performances.
From my understanding, using parallel_for
ain't a good practice (in fact it runs very slow), instead (I think) I should use a single_task
and an unrolled for loop. I'm struggling to make it work properly though.
So, I have
q.submit([&](sycl::handler &h){
h.parallel_for<class Foo>(sycl::nd_range<1>(n_blocks * n_threads, n_threads),
[=](auto& it) {
some_kernel(it, <other params here ...> );
});
}).wait();
and my attempt is
q.submit([&](sycl::handler &h){
h.single_task<class Foo>(
#pragma unroll
for(int i = 0; i < n_blocks * n_threads; ++i)
some_kernel(...)
);
}).wait();
But I'm not sure how to adapt what I was previously doing with a sycl::item
(for instance, how to use the loop index to replace the calls to the methods get_group
, get_local_id
? ).
Should I entirely change the design of the kernel ? In other word, is the "work_groups - work_group_size" approach not appropriate with FPGAs ?