The memory copy calls from CPU-GPU and vice-versa in Unified Virtual Addressing (UVA) in CUDA are scheduled internally. However, the nvprof cuda profiler does not report the PCI-e bus transactions for UVA. Is there any way to know the data transfers that are taking place from host to device and device to host?
Asked
Active
Viewed 308 times
0
-
1By "PCI-e Transactions", do you mean `cudaMemcpy` operations? `nvprof` certainly does report those even with UVA enabled. [Here's](http://pastebin.com/H8GK3sQj) an example. Note that [UVA](http://docs.nvidia.com/cuda/cuda-runtime-api/group__CUDART__UNIFIED.html#group__CUDART__UNIFIED) (Unified Virtual Addressing) is not the same thing as [UM](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-unified-memory-programming-hd) (Unified Memory). Do you mean UM ? – Robert Crovella Jul 14 '14 at 13:46
1 Answers
2
Yes, it's possible to get nvprof
to report on Unified Memory activities. You may wish to study the options that are available using
nvprof --help
If you combine the --print-gpu-trace
and --unified-memory-profiling per-process-device
options, you should get some results indicating the UM activity.
Here is an example:
$ cat t476.cu
#include <stdio.h>
#define cudaCheckErrors(msg) \
do { \
cudaError_t __err = cudaGetLastError(); \
if (__err != cudaSuccess) { \
fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
msg, cudaGetErrorString(__err), \
__FILE__, __LINE__); \
fprintf(stderr, "*** FAILED - ABORTING\n"); \
exit(1); \
} \
} while (0)
__global__ void mykernel(int *d_data){
printf("Data = %d\n", *d_data);
*d_data = 0;
}
int main(){
cudaDeviceProp myprop;
int mydevice;
int numdevices;
cudaGetDeviceCount(&numdevices);
cudaCheckErrors("get dev count fail");
for (mydevice = 0; mydevice < numdevices; mydevice++){
cudaGetDeviceProperties(&myprop, mydevice);
printf("device %d: %s\n", mydevice, myprop.name);
printf("device %d supports unified addressing: ", mydevice);
if (myprop.unifiedAddressing) printf(" yes\n");
else printf(" no\n");
printf("device %d supports managed memory: ", mydevice);
if (myprop.managedMemory) printf(" yes\n");
else printf(" no\n");
}
cudaSetDevice(--mydevice);
printf("using device %d\n", mydevice);
int h_data = 1;
int *d_data;
cudaMalloc(&d_data, sizeof(int));
cudaMemcpy(d_data, &h_data, sizeof(int), cudaMemcpyHostToDevice);
mykernel<<<1,1>>>(d_data);
cudaMemcpy(&h_data, d_data, sizeof(int), cudaMemcpyDeviceToHost);
printf("data = %d\n", h_data);
printf("now testing managed memory\n");
int *m_data;
cudaMallocManaged(&m_data, sizeof(int));
cudaCheckErrors("managed mem fail");
*m_data = 1;
mykernel<<<1,1>>>(m_data);
cudaDeviceSynchronize();
printf("data = %d\n", m_data);
cudaCheckErrors("some error");
return 0;
}
$ nvcc -arch=sm_35 -o t476 t476.cu
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./t476
==5114== NVPROF is profiling process 5114, command: ./t476
device 0: GeForce GT 640
device 0 supports unified addressing: yes
device 0 supports managed memory: yes
using device 0
Data = 1
data = 0
now testing managed memory
Data = 1
data = 0
==5114== Profiling application: ./t476
==5114== Profiling result:
Start Duration Grid Size Block Size Regs* SSMem* DSMem* Size Throughput Device Context Stream Unified Memory Name
1.10622s 1.1200us - - - - - 4B 3.5714MB/s GeForce GT 640 1 7 - [CUDA memcpy HtoD]
1.10687s 64.481us (1 1 1) (1 1 1) 32 0B 0B - - GeForce GT 640 1 7 - mykernel(int*) [102]
1.10693s 2.3360us - - - - - 4B 1.7123MB/s GeForce GT 640 1 7 - [CUDA memcpy DtoH]
1.12579s - - - - - - - - GeForce GT 640 - - 0 [Unified Memory CPU page faults]
1.12579s - - - - - - - - GeForce GT 640 - - 0 B [Unified Memory Memcpy DtoH]
1.12579s - - - - - - - - GeForce GT 640 - - 0 B [Unified Memory Memcpy HtoD]
1.12590s 64.097us (1 1 1) (1 1 1) 32 0B 0B - - GeForce GT 640 1 7 - mykernel(int*) [108]
1.12603s - - - - - - - - GeForce GT 640 - - 4096 B [Unified Memory Memcpy DtoH]
1.12603s - - - - - - - - GeForce GT 640 - - 4096 B [Unified Memory Memcpy HtoD]
1.12603s - - - - - - - - GeForce GT 640 - - 1 [Unified Memory CPU page faults]
Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
$

Robert Crovella
- 143,785
- 11
- 213
- 257
-
Thanks for the reply..I want to know the PCI transfers happening between CPU and GPU in Unified Virtual Addressing (UVA). Are there any data transfers in UVA. nvprof does not report any such transfers. – Shilpa Babalad Jul 15 '14 at 11:25
-
You'll note I asked a question for clarification below your question above. Do you mean transfers associated with zero-copy? UVA transfers is unclear. Perhaps you should provide a short sample program that identifies the transfers you have in mind. – Robert Crovella Jul 15 '14 at 12:37
-
Yes, I want to know about the transfers associated with zero-copy. – Shilpa Babalad Jul 16 '14 at 03:33
-
-
I don't think you'll be able to monitor zero-copy transactions directly with the profiler. The best you could do is probably gather some memory statistics and make some inferences. See [here](http://stackoverflow.com/questions/13871449/cuda-zero-copy-performance) – Robert Crovella Jul 17 '14 at 21:20