I've wrote the following code in DPC++ to test time consumption.
// ignore sth for defining subdevices
cl::sycl::queue q[4] = {cl::sycl::queue{SubDevices1[0]}, cl::sycl::queue{SubDevices1[1]},
cl::sycl::queue{SubDevices2[0]}, cl::sycl::queue{SubDevices2[1]}};
void run(){
for(int i = 0; i < 4; i++){
q[i].submit([&](auto &h) {
h.parallel_for(
sycl::nd_range<2>(sycl::range<2>(1, 1), sycl::range<2>(1, 1)),
[=](sycl::nd_item<2> it){
// just empty
}
);
});
}
}
It cost about 0.6s.
When testing for one queue with one parallel_for, it cost about 0.15s.
A more wired thing happened when testing
q[i].submit([&](auto &h) {h.memcpy(...);});
When the array copied is small, this command consumes nearly no time.
How to optimize the above code in run()? Very thanks!