0

I've wrote the following code in DPC++ to test time consumption.

// ignore sth for defining subdevices
cl::sycl::queue q[4] = {cl::sycl::queue{SubDevices1[0]}, cl::sycl::queue{SubDevices1[1]},
                        cl::sycl::queue{SubDevices2[0]}, cl::sycl::queue{SubDevices2[1]}};

void run(){
    for(int i = 0; i < 4; i++){
        q[i].submit([&](auto &h) {
        h.parallel_for(
            sycl::nd_range<2>(sycl::range<2>(1, 1), sycl::range<2>(1, 1)),
            [=](sycl::nd_item<2> it){
                // just empty
                }
            );
        });
    }
}

It cost about 0.6s.

When testing for one queue with one parallel_for, it cost about 0.15s.

A more wired thing happened when testing

q[i].submit([&](auto &h) {h.memcpy(...);});

When the array copied is small, this command consumes nearly no time.

How to optimize the above code in run()? Very thanks!

lastans7
  • 143
  • 5
  • 1
    What are you trying to achieve? Running an empty kernel is not going to tell you a lot about anything. The array you are talking about will affect the performance of the kernel in general because the cost of transferring the memory from the CPU to the GPU will affect the overall performance. A small memory transfer will make less of a performance hit than a larger memory transfer. Take a look at some of the sample code, in particular I can recommend looking at SYCL Academy https://github.com/codeplaysoftware/syclacademy – Rod Burns Aug 09 '22 at 09:35

1 Answers1

0

If you run on different devices then all queues will execute parallelly.

If you want to run on a single device, you need to create a context for each queue then it will execute in a parallel manner.

context c1{};
queue q1{c1,gpu_selector()};