3

In CUDA, we can achieve kernel managed data transfer from host memory to device shared memory by device side pointer of host memory. Like this:

int  *a,*b,*c;          // host pointers
int *dev_a, *dev_b, *dev_c;     // device pointers to host memory

    …       

cudaHostGetDevicePointer(&dev_a, a, 0); // mem. copy to device not need now, but ptrs needed instead
cudaHostGetDevicePointer(&dev_b, b, 0);
cudaHostGetDevicePointer(&dev_c ,c, 0);

    …   

//kernel launch
add<<<B,T>>>(dev_a,dev_b,dev_c); 
// dev_a, dev_b, dev_c are passed into kernel for kernel accessing host memory directly.

In the above example, kernel code can access host memory via dev_a, dev_b and dev_c. Kernel can utilize these pointers to move data from host to shared memory directly without relaying them by global memory.

But seems that it is an mission impossible in OpenCL? (local memory in OpenCL is the counterpart of shared memory in CUDA)

tomix86
  • 1,336
  • 2
  • 18
  • 29
jxj
  • 145
  • 3
  • 10
  • It is impossible to have the same functionality. However, it is not a core feature. I can't think of a case where you are really forced to do this kind of operation. Isn't it possible to just call clEnqueueWriteBuffer() before calling the kernel? It the device local cache a limitting factor for you? – DarkZeros Nov 05 '13 at 17:14
  • It is a good manner for overlapping transfer and computation. Explicit copy operation issued from host side is avoided, which can only copy data from host to device global memory. By device side pointer, data can be transfered between host and shared memory directly. It allows device schedule computation and data transfer as its need, which implies data transfer may be hidden. Traditional way is multi streams (CUDA) and multi command queues(OpenCL). The traditional way needs explicit scheduling in host side, which makes the overall code a little bit ugly/hairy. – jxj Nov 06 '13 at 03:38

1 Answers1

3

You can find exactly identical API in OpenCL.

How it works on CUDA:

According to this presentation and the official documentation.

The money quote about cudaHostGetDevicePointer :

Passes back device pointer of mapped host memory allocated by cudaHostAlloc or registered by cudaHostRegister.

CUDA cudaHostAlloc with cudaHostGetDevicePointer works exactly like CL_MEM_ALLOC_HOST_PTR with MapBuffer works in OpenCL. Basically if it's a discrete GPU the results are cached in the device and if it's a discrete GPU with shared memory with the host it will use the memory directly. So there is no actual 'zero copy' operation with discrete GPU in CUDA.

The function cudaHostGetDevicePointer does not take raw malloced pointers in, just like what is the limitation in OpenCL. From the API users point of view those two are exactly identical approaches allowing the implementation to do pretty much identical optimizations.

With discrete GPU the pointer you get points to an area where the GPU can directly transfer stuff in via DMA. Otherwise the driver would take your pointer, copy the data to the DMA area and then initiate the transfer.

However in OpenCL2.0 that is explicitly possible, depending on the capabilities of your devices. With the finest granularity sharing you can use randomly malloced host pointers and even use atomics with the host, so you could even dynamically control the kernel from the host while it is running.

http://www.khronos.org/registry/cl/specs/opencl-2.0.pdf

See page 162 for the shared virtual memory spec. Do note that when you write kernels even these are still just __global pointers from the kernel point of view.

sgarizvi
  • 16,623
  • 9
  • 64
  • 98
sharpneli
  • 1,601
  • 10
  • 9
  • Hi, thanks for your replying. But I don't think OpenCL2.0 works for my original purpose. Neither CL_USE_HOST_PTR nor CL_MEM_ALLOC_HOST_PTR in OpenCL1.2. I actually inspected Nvidia's SDK examples and best practice guide, then I realize that all OpenCL tricks plays before kernel launch, so this doesn't help overlapping transfer and computation (CUDA's device side pointer of host memory does, because transfer begins after kernel launch on kernel's demand) – jxj Nov 06 '13 at 07:15
  • It depends on the implementation on at what stage the actual transfer happens. Even if you issue the commands before the kernel launch the implementation is completely free to belay the reads until kernel launch itself. However in most cases it is not the optimal way, simply caching the memory object is fastest in most cases. Do you have any particular reason for wanting to transfer directly from host on int sized chunks? – sharpneli Nov 06 '13 at 08:15
  • I edited the main post. According to information I found the Cuda device side pointer is just pinned memory and will be cached before the cuda kernel launch, on discrete GPU that is. It is not surprising because memory latency over PciE bus would completely destroy the performance if the data is not transferred before launch. – sharpneli Nov 06 '13 at 08:28
  • Thank you sharpneli. Seems that caching is a must (implicitly) for discrete GPU. The first question is where to cache? (on-GPU or off-GPU mem?). An more important question is that the caching has to been completely done before getting the device side pointer (kernel launching)? Or the caching can be done in a pipelining manner on-kernel's-demand even after kernel launching. The former is like OpenCL's mechanism, the latter may have advantage of transfer-computation-overlapping. – jxj Nov 07 '13 at 09:47
  • The global memory of GPU is the fastest memory right after the local memory, so that's where the implementation will almost certainly move all the data. The latency of it is roughly 1/100 of the PCIe bus latency. The caching will be done completely, as you can see you cannot write to the buffer anymore from the host after gaining the device pointer. I recommend simply transferring the buffers for the next stage of computation while you are calculating the previous stage in the GPU. Trying to overlap transfers of a kernel that is already running is quite likely a wasted effort in current hw. – sharpneli Nov 07 '13 at 10:14
  • In the CUDA_C_Programming_Guide, 3.2.4.3. Mapped Memory. "Accessing host memory directly from within a kernel has several advantages: ... There is no need to use streams (see Concurrent Data Transfers) to overlap data transfers with kernel execution; the kernel-originated data transfers automatically overlap with kernel execution." – jxj Nov 13 '13 at 07:07
  • And in the programming guide if you search ahead you will find that the zerocopy requires the presence of "canMapHostMemory". Which again is the same as the SVM fine grained sharing in Ocl 2.0. – sharpneli Nov 13 '13 at 09:21
  • Another money quote is "Such a block has therefore in general two addresses: one in host memory that is returned by cudaHostAlloc() or malloc(), and one in device memory that can be retrieved using cudaHostGetDevicePointer() and then used to access the block from within a kernel. The only exception is for pointers allocated with cudaHostAlloc() and when a unified address space is used for the host and the device as mentioned in Unified Virtual Address Space." – sharpneli Nov 13 '13 at 09:55
  • "Trying to overlap transfers of a kernel that is already running is quite likely a wasted effort in current hw.". According to CUDA_C_Programming_Guide, seems that by device side pointer, overlapping can be achieved automatically? We test the canMapHostMemory property on an old GTX480, it supports this proerty. – jxj Nov 13 '13 at 10:02
  • Yes. And as you read from the CUDA_C_Programming guide under 3.2.4.3 Mapped memory I quoted on my previous comment it's possible that the block is copied in the device memory "Such a block has therefore in general two addresses: ...". By using that you don't have to use the Cuda stream API (and in OpenCL you don't have to use enqueueWriteBuffer) but you still need to handle the synchronization "Since mapped page-locked memory is shared between host and device however, the application must synchronize memory accesses using streams or events". 1/2 – sharpneli Nov 13 '13 at 10:15
  • What I meant by "Trying to overlap transfers" was not a block level transfers to the device. But delaying the read from across the PciE bus until the kernel actually issues the memory read. It was written badly by me. So: Overlapping transfer of rest of the buffer while kernels churn the start of the buffer/image/whatnot = good. Delaying reads until the read is actually issued across PciE bus = bad. In the first case the data is implicitly moved to the global memory, in the second it is not. 2/2. – sharpneli Nov 13 '13 at 10:18
  • http://developer.amd.com/download/AMD_Accelerated_Parallel_Processing_OpenCL_Programming_Guide.pdf See page 4-17 (77) to see what OpenCL memory objects produce equivalent behaviour in AMD HW as the Cuda produces for NVidia. As you see the actual behaviour depends on the device type. The same applies to Cuda in NVidia HW (Which is why they say 'may' and 'it's possible' so often in the Cuda programming guide). – sharpneli Nov 13 '13 at 10:31
  • In CUDA_C_Programming_Guide 3.2.4.3: "Such a block has therefore in general two addresses: ...". But soon after that: "The only exception is for pointers ... Unified Virtual Address Space". In 3.2.7. Unified Virtual Address Space, the 64bits and compute capability 2.0 can be met easily now. By using device side pointer, there is no way for us to know if the real implementation is the "good" or "bad" as you said. In Nvida's profiling tool, there is only kernel execution period shown, not any other transfer period can be seen because we don't use writing buffer explicitly. – jxj Nov 14 '13 at 06:08