19

Is it possible to access hard disk/ flash disk directly from GPU (CUDA/openCL) and load/store content directly from the GPU's memory ?

I am trying to avoid copying stuff from disk to memory and then copying it over to GPU's memory.

I read about Nvidia GPUDirect but not sure if it does what I explained above. It talks about remote GPU memory and disks but the disks in my case are local to the GPU.

Basic idea is to load contents (something like dma) -> do some operations -> store contents back to disk (again in dma fashion).

I am trying to involve CPU and RAM as little as possible here.

Please feel free to offer any suggestions about the design.

L Lawliet
  • 2,565
  • 4
  • 26
  • 35
  • 1
    Can someone please explain what's wrong with the question ? – L Lawliet Dec 03 '14 at 22:02
  • 8
    It's not possible without host intervention. The host owns the disk drive. GPUDirect is for transferring data between PCIE devices, fundamentally. If you had your own PCIE HDD controller, on the same PCIE fabric as the GPU, and access to the device driver source code, you could conceivably write a GPUDirect RDMA driver that would allow for direct transfer from GPU to disk. (It will still require host intervention to set up.) In practice, nobody assumes that this is the level of effort that you want to take on. – Robert Crovella Dec 03 '14 at 22:13
  • 1
    In practical terms, system memory throughput (25-50 GB/sec) and PCIe gen3 throughput (10-12 GB/sec) are so high compared to SSD throughput (0.5 GB/sec) that there should be only minimal impact on the throughput of GPU<->disk transfers when moving the data through the host. Latency may be a different story, but the question does not state specific latency or throughput requirements. – njuffa Dec 04 '14 at 05:44
  • 3
    @Siddharth You may want to take a look at [this GTC 2014 presentation](http://on-demand.gputechconf.com/gtc/2014/presentations/S4265-rdma-gpu-direct-for-fusion-io-iodrive.pdf) that discusses GPUdirect RDMA access to SSD-like storage. – njuffa Dec 04 '14 at 05:54
  • @njuffa 2016 is here, and so are NVMe SSDs with >2GB/s throughput. – bit2shift Jul 28 '16 at 10:33
  • @bit2shift I am aware of that but would claim it doesn't invalidate my earlier comment, when set in relation to the 12 GB/sec (one-way) bandwidth of a PCIe gen3 x16 link, or the 25-60 GB/sec bandwidth of system memory. Note that I also pointed the asker at the possibility of RDMA access for (enterprise class) SSDs, which should cover cases where the impact of going through CPU/system memory is not acceptable. – njuffa Jul 29 '16 at 19:18
  • @njuffa So, how does your comment stand given the annoucement of the AMD Radeon Pro SSG? – bit2shift Jul 30 '16 at 01:31
  • 1
    @bit2shift As far as I know the Radeon SSG product is vaporware at this point, or, if you will, a proof of concept. To my knowledge, it couples two SSDs (in a RAID0 configuration) to the GPU with one PCIe gen3 x4 link each, meaning the SSD to GPU link has *half* the aggregate throughput of a GPU's normal PCIe x16 link to the system. It *may* do so at lower latency, but does not look like a game changer to me. – njuffa Jul 30 '16 at 02:02
  • @njuffa: Has that SSD-like storage gone anywhere in recent years? – einpoklum Nov 26 '17 at 22:32
  • @einpoklum Not sure what you are asking about. With NVMe, SSD storage can now offer multi-GB/s throughput at reasonable prices. [This recent thread](https://devtalk.nvidia.com/default/topic/1026863/cuda-programming-and-performance/p2p-dma-performance-limitation-where-a-single-cpu-is-invoked/) in the NVIDIA forums reports a throughout of 7.1 GB/sec from three SSDs in RAID-0 configuration to the GPU, although it is not clear that RDMA is used in that. PCIe gen3 is unchanged at 12 GB/sec throughput, system memory has improved to ~75 GB/sec for high-end systems (>= four channels of DDR4). – njuffa Nov 26 '17 at 22:50
  • @njuffa: I meant to ask if there is SSD/SSD-like storage, visible from PCIe directly, available these days as a real product other than a curiosity. And then there's the question of whether there's that plus GPUDirect support - but I just [asked this second question independently](https://stackoverflow.com/questions/47501827/which-kinds-of-devices-support-gpudirect-rdma-these-days) – einpoklum Nov 26 '17 at 22:53

3 Answers3

15

For anyone else looking for this, 'lazy unpinning' did more or less what I want.

Go through the following to see if this can be helpful for you.

The most straightforward implementation using RDMA for GPUDirect would pin memory before each transfer and unpin it right after the transfer is complete. Unfortunately, this would perform poorly in general, as pinning and unpinning memory are expensive operations. The rest of the steps required to perform an RDMA transfer, however, can be performed quickly without entering the kernel (the DMA list can be cached and replayed using MMIO registers/command lists).

Hence, lazily unpinning memory is key to a high performance RDMA implementation. What it implies, is keeping the memory pinned even after the transfer has finished. This takes advantage of the fact that it is likely that the same memory region will be used for future DMA transfers thus lazy unpinning saves pin/unpin operations.

An example implementation of lazy unpinning would keep a set of pinned memory regions and only unpin some of them (for example the least recently used one) if the total size of the regions reached some threshold, or if pinning a new region failed because of BAR space exhaustion (see PCI BAR sizes).

Here is a link to an application guide and to nvidia docs.

Erik Kaplun
  • 37,128
  • 15
  • 99
  • 111
L Lawliet
  • 2,565
  • 4
  • 26
  • 35
6

Trying to use this feature, I wrote a small example on Windows x64 to implement this. In this example, kernel "directly" accesses disk space. Actually, as @RobertCrovella mentioned previously, the operating system is doing the job, with probably some CPU work; but no supplemental coding.

__global__ void kernel(int4* ptr)
{
    int4 val ; val.x = threadIdx.x ; val.y = blockDim.x ; val.z = blockIdx.x ; val.w = gridDim.x ;
    ptr[threadIdx.x + blockDim.x * blockIdx.x] = val ;
    ptr[160*1024*1024 + threadIdx.x + blockDim.x * blockIdx.x] = val ;
}

#include "Windows.h"

int main()
{
    // 4GB - larger than installed GPU memory
    size_t size = 256 * 1024 * 1024 * sizeof(int4) ;

    HANDLE hFile = ::CreateFile ("GPU.dump", (GENERIC_READ | GENERIC_WRITE), 0, 0, CREATE_ALWAYS, FILE_ATTRIBUTE_NORMAL, NULL) ;

    HANDLE hFileMapping = ::CreateFileMapping (hFile, 0, PAGE_READWRITE, (size >> 32), (int)size, 0) ;

    void* ptr = ::MapViewOfFile (hFileMapping, FILE_MAP_ALL_ACCESS, 0, 0, size) ;

    ::cudaSetDeviceFlags (cudaDeviceMapHost) ;

    cudaError_t er = ::cudaHostRegister (ptr, size, cudaHostRegisterMapped) ;
    if (cudaSuccess != er)
    {
        printf ("could not register\n") ;
        return 1 ;
    }

    void* d_ptr ;
    er = ::cudaHostGetDevicePointer (&d_ptr, ptr, 0) ;
    if (cudaSuccess != er)
    {
        printf ("could not get device pointer\n") ;
        return 1 ;
    }

    kernel<<<256,256>>> ((int4*)d_ptr) ;

    if (cudaSuccess != ::cudaDeviceSynchronize())
    {
        printf ("error in kernel\n") ;
        return 1 ;
    }

    if (cudaSuccess != ::cudaHostUnregister (ptr))
    {
        printf ("could not unregister\n") ;
        return 1 ;
    }

    ::UnmapViewOfFile (ptr) ;

    ::CloseHandle (hFileMapping) ;
    ::CloseHandle (hFile) ; 

    ::cudaDeviceReset() ;

    printf ("DONE\n");

    return 0 ;
}
Florent DUGUET
  • 2,786
  • 16
  • 28
  • 1
    Won't this just cache everything in main memory and occasionally/finally flush changes to disk? – einpoklum Nov 26 '17 at 22:31
  • @einpoklum, it sounded to me a reasonable answer to the initial question: "Basic idea is to load contents (something like dma) -> do some operations -> store contents back to disk (again in dma fashion)." It might indeed be the case, though. However, programmatically, nothing needs to be implemented on CPU side. – Florent DUGUET Nov 27 '17 at 10:09
1

The real solution is on the horizon!

Early access: https://developer.nvidia.com/gpudirect-storage

GPUDirect® Storage (GDS) is the newest addition to the GPUDirect family. GDS enables a direct data path for direct memory access (DMA) transfers between GPU memory and storage, which avoids a bounce buffer through the CPU. This direct path increases system bandwidth and decreases the latency and utilization load on the CPU.

Unknown
  • 97
  • 1
  • 2
  • 8