Having defined how to deal with errors:
static void HandleError( cudaError_t err,
const char *file,
int line ) {
if (err != cudaSuccess) {
printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
file, line );
exit( EXIT_FAILURE );
}
}
#define HANDLE_ERROR( err ) (HandleError( err, __FILE__, __LINE__ ))
Normally, to store our results in the array d_results, of type double, of size N, that can be allocated in the GPU memory at once, we can manage to transfer the data from the device to the host like so:
double *d_results;
HANDLE_ERROR(cudaMalloc(&d_results,N*sizeof(double)));
//Launch our kernel to do some computations and store the results in d_results
.....
// and transfer our data from the device to the host
vector<double> results(N);
cudaMemcpy(results.data(),d_results,N*sizeof(double),cudaMemcpyDeviceToHost);
If the second line fails because there are not enough memory to store all the results at once. How can I manage to do the computations and transfer the results to the host properly? is mandatory to do the computation by batches? I rather to avoid a manual batching. What is the standard approach to manage this situation in CUDA?