CUDA why just reading (zero write) from unified memory cause next kernel to become slower

Question

#include <cuda_runtime.h>
#include <string>
#include <chrono>
#include <random>
#include <iostream>
using namespace std;

class MyTimer {
  std::chrono::time_point<std::chrono::system_clock> start;

public:
  void startCounter() {
      start = std::chrono::system_clock::now();
  }

  int64_t getCounterNs() {
      return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count();
  }

  int64_t getCounterMs() {
      return std::chrono::duration_cast<std::chrono::milliseconds>(std::chrono::system_clock::now() - start).count();
  }

  double getCounterMsPrecise() {
      return std::chrono::duration_cast<std::chrono::nanoseconds>(std::chrono::system_clock::now() - start).count()
              / 1000000.0;
  }
};

__global__ void HelloWorld();
void GenData(int N, float* a);
__global__ void Multiply(int N, float* __restrict__ output, const float* __restrict__ x, const float* __restrict__ y);

//----------
volatile double dummy = 0;

void Test(bool summing)
{
  MyTimer timer;

  int N = 10000000;
  float *d_x, *d_y, *d_res;
  cudaMallocManaged(&d_x, N * sizeof(float));
  cudaMallocManaged(&d_y, N * sizeof(float));
  cudaMallocManaged(&d_res, N * sizeof(float));
  cudaMemAdvise(d_res, N * sizeof(float), cudaMemAdviseSetAccessedBy, cudaCpuDeviceId);  // set direct access hint

  GenData(N, d_x);
  GenData(N, d_y);
  GenData(N, d_res);

  cudaMemPrefetchAsync(d_x, N * sizeof(float), 0, 0);
  cudaMemPrefetchAsync(d_y, N * sizeof(float), 0, 0);
  cudaMemPrefetchAsync(d_res, N * sizeof(float), 0, 0);
  cudaDeviceSynchronize();

  //-------------------
  int nloop = 100;
  double cost = 0;
  for (int t = 1; t <= nloop; t++) {
    timer.startCounter();
    Multiply<<<256,256>>>(N, d_res, d_x, d_y);
    cudaDeviceSynchronize();
    dummy = timer.getCounterMsPrecise();
    cost += dummy;

    // This only read data, and doesn't write.
    // Why does it still invalidate d_res memory pages on the GPU ?
    // Is there any way to read data from d_res without making the next kernel call slower?
    float sum = rand() % 1000;
    if (summing) {
      // either line below will make the next kernel slower
      cudaMemPrefetchAsync(d_res, N * sizeof(float), cudaCpuDeviceId, 0);
      //for (int i = 0; i < N; i++) sum += d_res[i];
    }
    cudaDeviceSynchronize();
    dummy = sum;
  }

  cout << "Summing = " << summing << " cost = " << cost / nloop << "\n";
  cudaFree(d_x);
  cudaFree(d_y);
  cudaFree(d_res);
}

int main()
{
  srand(time(NULL));
  HelloWorld<<<1,1>>>();

  Test(false);
  Test(true);
  Test(false);
  Test(true);
  return 0;
}


//-----------------------------
//-----------------------------
//-----------------------------
__global__
void HelloWorld()
{
printf("Hello world\n");
}

void GenData(int N, float* a)
{
  for (int i = 0; i < N; i ++) a[i] = float(rand() % 1000) / (rand() % 1000 + 1);
}

__global__
void Multiply(int N, float* __restrict__ output, const float* __restrict__ x, const float* __restrict__ y)
{
  int start = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  for (int i = start; i < N; i += stride) {
    output[i] = x[i] * y[i];
  }
}

Compile command: nvcc -o main main.cu -O3 -std=c++17, 2080ti Output:

Hello world
Summing = 0 cost = 0.237617
Summing = 1 cost = 12.1865
Summing = 0 cost = 0.235626
Summing = 1 cost = 11.8909

I have a bunch of Unified Memory (UM) GPU array (like Matlab's gpuArray). 99.9% of the computation/write operations will be done on the GPU device memory. Sometimes the result is read on CPU.

I notice that just reading a UM array on CPU side will cause the next kernel call on that array to become much slower. I assume this is due to some kind of page fault. But usually only memory writes cause page faults.

Why does this happen? How can I change the code so that reading from UM doesn't make the next kernel slower? (edit: without using cudaMemPrefetch, if possible)

Edit: also, directManagedMemAccessFromHost gives 0 on 2080ti and Nvidia A30. So which system support that feature?

Well, think about it like this: At the very least the next write access has to trap in order to invalidate the same page in host memory, right? Otherwise the host would continue reading stale data. What I don't know right now is whether memory is ever truly in a shared state, or whether a page is just constantly being transferred between host and device, even if the host did no write access — Homer512, Mar 25 '23 at 19:35

Robert Crovella · Accepted Answer · 2023-03-25T21:00:54.503

You may want to study unit 6 of this online training course.

I assume this is due to some kind of page fault. But usually only memory writes cause page faults.

I'm not sure where you came up with that idea. In the typical case for demand-paged virtual memory, so as to "expand" the memory space usable in a virtual system/OS, a page of data may be physically resident in memory or it may be "paged out" to disk. If the page is physically resident, your (host) code can either read or write to it "normally". If a page is not physically resident, i.e. it is "paged out" to disk, then either a read or write access to that page will trigger a page fault, so that the page will be brought in from disk, made physically resident, and then subsequent accesses can proceed. It's not the case that only one or the other type of access will cause a page fault.

My discussion at this point forward will presume a system for which the concurrentManagedAccess property is true, and a few other lesser assumptions about system topology, devices, etc.

Moving on to managed memory, then, from the point of the view of the host, you could pretend that device memory is like the disk. When a page is resident in device memory, it is not resident in host memory, and vice versa. If host code "touches" a page that is resident in device memory, a page fault will occur, regardless of whether the "touch" is a read or write access, and the UM system will respond to that page fault by "migrating" data. It will physically move a page of data from device memory, to host memory. Once it is in host memory, the page fault is "serviced" and the host access can be allowed to proceed at that point.

Now, having done this, for an ordinary UM allocation, the page in question is no longer resident in device memory. If device code then attempts to touch that page (read or write access) a device "page fault" will occur, and the UM system will respond to the page fault by migrating that page from host to device. At that point, it is no longer resident in host memory, and a subsequent host access would trigger another page fault.

Given all that discussion, then, a host code access to a page will cause a "slower" subsequent device activity on that page, because of the necessity to migrate the data.

So it is completely understandable that " just reading (zero write) from unified memory cause next kernel to become slower".

You can eliminate the slowdown of the next kernel by prefetching the data back to the device after a host code access, and before the next kernel launch.

The unified memory system does have additional capabilities, that can be requested or hinted (you may wish to study the online training unit I mentioned). Such capabilities include the possibility to prevent the movement (migration) of data, while still allowing access from both host and device code. However the basic mechanism to do this is equivalent to allocating the data via cudaHostAlloc, with all that that implies. It is often worse for general usage, than allowing migration, and worse than prefetching in the general case.

Edit: also, directManagedMemAccessFromHost gives 0 on 2080ti and Nvidia A30. So which system support that feature?

I expect that feature may depend on a non-x86 CPU on systems that have a Jetson (ARM) CPU, Power9 CPU or a Grace (ARM) CPU. I've not verified any of that, however.

Oh, I hoped that there exist a solution that doesn't require `cudaMemPrefetch` (since that's obvious). Thanks for explaining why it's the only way. — Huy Le, Mar 26 '23 at 07:09

CUDA why just reading (zero write) from unified memory cause next kernel to become slower

1 Answers1