Post-Pascal UM can allocate more memory than the GPU memory, which swap-in swap-out between GPU memory and host memory automatically.
So what if run out of GPU memory and host memory? How can I use the swap space on disk? Virtual memory swap space seems not work in cudaMallocManaged
case. Here is how I did the experiment:
- create swap space:
dd if=/dev/zero of=./swapfile bs=1G count=16
,mkswap
andswapon
- create host memory occupier, burn out 99% of host memory
for (i = 0; i < 8000; i++)
malloc(1<<20);
- create GPU memory occupier with
cudaMalloc
, leaving 1G of GPU memory - offical uvm demo but with a 12GB workload, which will run out of all available GPU memory and host memory
#include <iostream>
#include <math.h>
// CUDA kernel to add elements of two arrays
__global__ void add(int n, float *x, float *y) {
int index = blockIdx.x * blockDim.x + threadIdx.x;
int stride = blockDim.x * gridDim.x;
for (int i = index; i < n; i += stride)
y[i] = x[i] + y[i];
}
int main(void) {
long long N = 6LL * (1 << 30) / sizeof(float); // <<<<<<<<<<<
float *x, *y;
// Allocate Unified Memory -- accessible from CPU or GPU
cudaMallocManaged(&x, N * sizeof(float));
cudaMallocManaged(&y, N * sizeof(float));
// initialize x and y arrays on the host
for (int i = 0; i < N; i++) {
x[i] = 1.0f;
y[i] = 2.0f;
}
// Launch kernel on 1M elements on the GPU
int blockSize = 256;
int numBlocks = (N + blockSize - 1) / blockSize;
add<<<numBlocks, blockSize>>>(N, x, y);
// Wait for GPU to finish before accessing on host
cudaDeviceSynchronize();
// Check for errors (all values should be 3.0f)
float maxError = 0.0f;
for (int i = 0; i < N; i++)
maxError = fmax(maxError, fabs(y[i] - 3.0f));
std::cout << "Max error: " << maxError << std::endl;
// Free memory
cudaFree(x);
cudaFree(y);
return 0;
}
It just get kill by oom-killer and leave swap space alone.
It there is a pure lib malloc
with 100GB workload, you can see that swap space usage is growing.
UM/UVA can use gpu memory + host memory + swap space, just like virtual memory say