Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

Question

It is possible to use nvprof to access/read bank conflicts counters for CUDA exec:

nvprof --events shared_st_bank_conflict,shared_ld_bank_conflict my_cuda_exe

However it does not work for the code that uses OpenCL rather then CUDA code.

Is there any way to extract these counters outside nvprof from OpenCL environment, maybe directly from ptx?
Alternatively is there any way to convert PTX assembly generated from nvidia OpenCL compiler using clGetProgramInfo with CL_PROGRAM_BINARIES to CUDA kernel and run it using cuModuleLoadDataEx and thus be able to use nvprof?
Is there any simulation CPU backend that allows to set such parameters as bank size etc?

Additional option:

Use converter of opencl to cuda code inlcuding features missing from CUDA like vloadn/vstoren, float16, and other various accessors. #define work only for simple kernels. Is there any tool that provides it?

Can you pass the OpenCL-generated PTX to `cuModuleLoadDataEx`? There's no guarantee that the same `ptxas` compilation from PTX to SASS is the same, but it's a reasonable guess. It's possible options to `ptxas` differ from OpenCL and CUDA (e.g. rounding rules). There's no guarantee that you'd be profiling the same programs, but perhaps it's a good approximation. — Tim, Oct 19 '20 at 23:10

talonmies · Answer 1 · 2020-10-25T03:04:35.670

1

Is there any way to extract these counters outside nvprof from OpenCL environment, maybe directly from ptx?

No. Nor is there in CUDA, nor in compute shaders in OpenGL, DirectX or Vulkan.

Alternatively is there any way to convert PTX assembly generated from nvidia OpenCL compiler using clGetProgramInfo with
CL_PROGRAM_BINARIES to CUDA kernel and run it using
cuModuleLoadDataEx and thus be able to use nvprof?

No. OpenCL PTX and CUDA PTX are not the same and can't be used interchangeably

Is there any simulation CPU backend that allows to set such parameters as bank size etc?

Not that I am aware of.

edited Oct 25 '20 at 03:04

answered Oct 25 '20 at 00:35

talonmies

70,661
34
192
269

"OpenCL PTX and CUDA PTX are not the same" of course. However in multiple cases of kernels what ere compiled for both CUDA (with defines for stuff like get_global_id) and OpenCL the PTX's were very-very similar up to small differences in headers. So theoretically conversion may be possible. But need to understand how. – Artyom Oct 25 '20 at 07:46
2

You asked "Is there any way to extract these counters outside nvprof ... maybe directly from ptx" , and the answer to that is no, you can't manipulate or access profiling data from user code. There are some programmer exposed trigger counters you can increment, and you can turn profiling data collection on and off, but that is it – talonmies Oct 25 '20 at 08:10
Awarding for answering but not accepting since I was looking for any kind of direction/solution and it wasn't provided. – Artyom Oct 27 '20 at 20:40
1

It is very hard to propose a solution when one doesn't exist. If you want to substitute your own alternative reality for the current state of affairs, you are perfectly entitled to do so. But you need to recognize you are using the vendor's fourth tier compute API in terms of support (after CUDA, the graphics API compute features, and the compiler driven APIs), and you should not expect it to do more than barely work. If you are wedded to OpenCL, use a different vendor. If you are wedded to NVIDIA, use CUDA. That is the harsh reality, whether you like it or not. – talonmies Oct 28 '20 at 02:19

Reading Shared/Local Memory Store/Load bank conflicts hardware counters for OpenCL executable under Nvidia

1 Answers1