I have a number of jobs to execute. Each job consists of a buffer write, a kernel execution and a buffer read and those operations must be of course executed in order. The various jobs are however indipendent and can therefore be executed concurrently.
Is there any performance difference between using multiple in-order command queues (like one would do with CUDA streams) and a single out-of-order one, with equivalent synchronization? Which is better?