OpenCL 2.x pipes - how do they actually work?

Question

I've read this description of the OpenCL 2.x pipe API and leaded through the Pipe API pages at khronos.org. I felt kind of jealous, working in CUDA almost exclusively, of this nifty feature available only in OpenCL (and sorry that CUDA functionality has not been properly subsumed by OpenCL, but that's a different issue), so I thought I'd ask "How come CUDA doesn't have a pipe mechanism". But then I realized I don't even know what that would mean exactly. So, instead, I'll ask:

How do OpenCL pipes work on AMD discrete GPUs / APUs? ...
- What info gets written where?
- How does the scheduling of kernel workgroups to cores effected by the use of pipes?
- Do piped kernels get compiled together (say, their SPIR forms)?
- Does the use of pipes allow passing data between different kernels via the core-specific cache ("local memory" in OpenCL parlance, "shared memory" in CUDA parlance)? That would be awesome.
Is there a way pipes are "supposed" to work on a GPU, generally? i.e. something the API authors envisioned or even put in writing?
How do OpenCL pipes work in CPU-based OpenCL implementations?

Few, if any, CPU-based OpenCL implementations actually support pipes. — Björn Lindqvist, Sep 27 '22 at 13:37

score 6 · Answer 1 · answered Aug 05 '16 at 03:59

6

OpenCL pipes were introduced along with OpenCL 2.0. On GPUs the OpenCL pipes is just like a global memory buffer with controlled access i.e you can limit the number of workgroups that are allowed to write/read to/from a pipe simultaneously. This kind of allows us to re-use the same buffer or pipe without worrying about conflicting reads or writes from multiple workgroups. As far as i know OpenCL pipes do not use GPU local memory. But if you carefully adjust the size of the pipe then you can increase the cache hits thus achieving better overall performance. There is no general rule as to when pipes should be used. I use pipes to pass data between 2 concurrently running kernels so that my program achieves better overall performance due to better cache hit ratio. This is the same way OpenCL pipes work in CPU as well (its just a global buffer which might fit in the system cache if its small enough). But on devices like FPGAs they work in a different manner. The pipes makes use of the local memory instead of the global memory in these devices and hence achieves considerable higher performance over using a global memory buffer.

answered Aug 05 '16 at 03:59

Johns Paul

633
6
22

I would add that emulating pipes on CUDA is fairly trivial with a global buffer, a reservation buffer, and atomics. Also, I believe pipes on AMD use local memory for the reservation buffers. – user703016 Aug 05 '16 at 05:05
@JohnsPaul: "Better cache hit ratio" - do you mean L2 cache or core-specific L1 cache? And - are AMD GPUs or their drivers able to "prefer" pipe consumers from the same core that last produced pipe data over consumers on other cores, so as to utilize L1 cache? – einpoklum Aug 05 '16 at 06:53
@AndreasPapadopoulos: 1. It might not be so trivial to avoid sacrificing performance due to synchronization using the atomics. 2. How would using local memory for the reservation buffer work, seeing that workgroups on different cores need to make reservations? – einpoklum Aug 05 '16 at 06:56
@einpoklum: I was talking about the L2 cache which is shared among all the GPU cores. I am not sure about the L1 cache. – Johns Paul Aug 05 '16 at 07:05
@einpoklum There's no "sacrificing performance", OpenCL pipes on GPU also likely use atomics. In fact the pipe API on GPU is mostly syntax sugar. Reservation buffers are per work-item/work-group. – user703016 Aug 05 '16 at 07:06
@AndreasPapadopoulos: I understand, but minimizing the cost of using atomics for a pipe implementation might be tricky rather than trivial (e.g. deciding on the layout of the global buffer, how many atomic variables and how they're used exactly etc.) – einpoklum Aug 05 '16 at 07:31
I'm not sure what you mean by that as the pipe interface pretty much dictates the implementation. Anyway, I implemented something similar in CUDA years ago, so maybe that's why I think it's trivial. – user703016 Aug 05 '16 at 07:32
@AndreasPapadopoulos: I have implemented the same on NVIDIA GPUs using atomics. The implementation is not so difficult, but at the same time the performance improvement will not be as significant as using OpenCL pipes (atleast in my experiments). Now there could be a lot of other factors affecting the performance here (like OpenCL versions etc etc.) I am just mentioning what i observed. – Johns Paul Aug 05 '16 at 07:37

OpenCL 2.x pipes - how do they actually work?

1 Answers1