6

I'm working on an a GPU accelerated program that requires the reading of an entire file of variable size. My question, what is the optimal number of bytes to read from a file and transfer to a coprocessor (CUDA device)?

These files could be as large as 2GiB, so creating a buffer of that size doesn't seem like the best idea.

Brian Tompsett - 汤莱恩
  • 5,753
  • 72
  • 57
  • 129
sj755
  • 3,944
  • 14
  • 59
  • 79

2 Answers2

4

You can cudaMalloc a buffer of the maximum size you can on your device. After this, copy over chunks of your input data of this size from host to device, process it, copy back the results and continue.

// Your input data on host
int hostBufNum = 5600000;
int* hostBuf   = ...;

// Assume this is largest device buffer you can allocate
int devBufNum = 1000000;
int* devBuf;

cudaMalloc( &devBuf, sizeof( int ) * devBufNum );

int* hostChunk  = hostBuf;
int hostLeft    = hostBufNum;
int chunkNum    = ( hostLeft < devBufNum ) ? hostLeft : devBufNum;

do
{
    cudaMemcpy( devBuf, hostChunk, chunkNum * sizeof( int ) , cudaMemcpyHostToDevice);
    doSomethingKernel<<< >>>( devBuf, chunkNum );

    hostChunk   = hostChunk + chunkNum;
    hostLeft    = hostBufNum - ( hostChunk - hostBuf );
} while( hostLeft > 0 );    
Ashwin Nanjappa
  • 76,204
  • 83
  • 211
  • 292
  • That part I was already planning, but what size should the chunks of input data be? – sj755 Mar 16 '12 at 03:15
  • The size of the largest array you can allocate on device. – Ashwin Nanjappa Mar 16 '12 at 03:29
  • 5
    You could consider using async memcopies of somewhat smaller chunks than will fit in memory (at most half) and processing chunk `k` in parallel with transferring chunk `k-1` back to the host and transferring chunk `k+1` from the host to device. Bidirectional overlap requires a Tesla GPU, but you can overlap one direction even on GeForce. – harrism Mar 16 '12 at 04:07
  • 1
    Also, you can use `cuGetMemInfo` to get the amount of available memory. http://forums.nvidia.com/index.php?showtopic=102339 – Jason R. Mick Mar 16 '12 at 06:00
  • 1
    @JasonR.Mick: since about CUDA 3.1 there has been cudaGetMemInfo in the runtime API, which does the same thing but saves having to mix runtime and driver APIs in host code. – talonmies Mar 16 '12 at 09:32
0

If you can split your function up so you can work on chunks on the card, you should look into using streams (cudaStream_t).

If you schedule loads and kernel executions in several streams, you can have one stream load data while another executes a kernel on the card, thereby hiding some of the transfer time of your data in the execution of a kernel.

You need to declare a buffer of whatever your chunk size is times however many streams you declare (up to 16, for compute capability 1.x as far as I know).

P O'Conbhui
  • 1,203
  • 1
  • 9
  • 16