0

The memory copy calls from CPU-GPU and vice-versa in Unified Virtual Addressing (UVA) in CUDA are scheduled internally. However, the nvprof cuda profiler does not report the PCI-e bus transactions for UVA. Is there any way to know the data transfers that are taking place from host to device and device to host?

  • 1
    By "PCI-e Transactions", do you mean `cudaMemcpy` operations? `nvprof` certainly does report those even with UVA enabled. [Here's](http://pastebin.com/H8GK3sQj) an example. Note that [UVA](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#group__CUDART__UNIFIED) (Unified Virtual Addressing) is not the same thing as [UM](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd) (Unified Memory). Do you mean UM ? – Robert Crovella Jul 14 '14 at 13:46

1 Answers1

2

Yes, it's possible to get nvprof to report on Unified Memory activities. You may wish to study the options that are available using

nvprof --help

If you combine the --print-gpu-trace and --unified-memory-profiling per-process-device options, you should get some results indicating the UM activity.

Here is an example:

$ cat t476.cu
#include <stdio.h>
#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

__global__ void mykernel(int *d_data){

  printf("Data = %d\n", *d_data);
  *d_data = 0;
}

int main(){

  cudaDeviceProp myprop;
  int mydevice;
  int numdevices;
  cudaGetDeviceCount(&numdevices);
  cudaCheckErrors("get dev count fail");
  for (mydevice = 0; mydevice < numdevices; mydevice++){
    cudaGetDeviceProperties(&myprop, mydevice);
    printf("device %d: %s\n", mydevice, myprop.name);
    printf("device %d supports unified addressing: ", mydevice);
    if (myprop.unifiedAddressing) printf(" yes\n");
    else printf("  no\n");
    printf("device %d supports managed memory: ", mydevice);
    if (myprop.managedMemory) printf(" yes\n");
    else printf("  no\n");
    }
  cudaSetDevice(--mydevice);
  printf("using device %d\n", mydevice);
  int h_data = 1;
  int *d_data;
  cudaMalloc(&d_data, sizeof(int));
  cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
  mykernel<<<1,1>>>(d_data);
  cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
  printf("data = %d\n", h_data);
  printf("now testing managed memory\n");
  int *m_data;
  cudaMallocManaged(&m_data, sizeof(int));
  cudaCheckErrors("managed mem fail");
  *m_data = 1;
  mykernel<<<1,1>>>(m_data);
  cudaDeviceSynchronize();
  printf("data = %d\n", m_data);
  cudaCheckErrors("some error");
  return 0;
}
$ nvcc -arch=sm_35 -o t476 t476.cu                                                                             
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./t476
==5114== NVPROF is profiling process 5114, command: ./t476
device 0: GeForce GT 640
device 0 supports unified addressing:  yes
device 0 supports managed memory:  yes
using device 0
Data = 1
data = 0
now testing managed memory
Data = 1
data = 0
==5114== Profiling application: ./t476
==5114== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream             Unified Memory  Name
1.10622s  1.1200us                    -               -         -         -         -        4B  3.5714MB/s  GeForce GT 640          1         7                          -  [CUDA memcpy HtoD]
1.10687s  64.481us              (1 1 1)         (1 1 1)        32        0B        0B         -           -  GeForce GT 640          1         7                          -  mykernel(int*) [102]
1.10693s  2.3360us                    -               -         -         -         -        4B  1.7123MB/s  GeForce GT 640          1         7                          -  [CUDA memcpy DtoH]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                          0  [Unified Memory CPU page faults]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                        0 B  [Unified Memory Memcpy DtoH]
1.12579s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                        0 B  [Unified Memory Memcpy HtoD]
1.12590s  64.097us              (1 1 1)         (1 1 1)        32        0B        0B         -           -  GeForce GT 640          1         7                          -  mykernel(int*) [108]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                     4096 B  [Unified Memory Memcpy DtoH]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                     4096 B  [Unified Memory Memcpy HtoD]
1.12603s         -                    -               -         -         -         -         -           -  GeForce GT 640          -         -                          1  [Unified Memory CPU page faults]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • Thanks for the reply..I want to know the PCI transfers happening between CPU and GPU in Unified Virtual Addressing (UVA). Are there any data transfers in UVA. nvprof does not report any such transfers. – Shilpa Babalad Jul 15 '14 at 11:25
  • You'll note I asked a question for clarification below your question above. Do you mean transfers associated with zero-copy? UVA transfers is unclear. Perhaps you should provide a short sample program that identifies the transfers you have in mind. – Robert Crovella Jul 15 '14 at 12:37
  • Yes, I want to know about the transfers associated with zero-copy. – Shilpa Babalad Jul 16 '14 at 03:33
  • UVA does not imply zero-copy, nor does zero-copy require UVA. – Robert Crovella Jul 16 '14 at 03:42
  • I don't think you'll be able to monitor zero-copy transactions directly with the profiler. The best you could do is probably gather some memory statistics and make some inferences. See [here](http://stackoverflow.com/questions/13871449/cuda-zero-copy-performance) – Robert Crovella Jul 17 '14 at 21:20