0

I'm trying to use LD_PRELOAD trick, in order to steal CUDA driver API calls (cu* functions). I first implemented a simple stub function for cuLaunchKernel as below:

#define _GNU_SOURCE
#include <cuda.h>
#include <dlfcn.h>
#include <stdio.h>
//#include <cuda_runtime.h>
//#include <driver_types.h>

void cuLaunchKernelHelper (CUstream hStream);


CUresult cuLaunchKernel (CUfunction f, unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ, unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra) {

        void* handle;
        CUresult (*function)(CUfunction f,  
                        unsigned int gridDimX, unsigned int gridDimY, unsigned int gridDimZ, 
                        unsigned int blockDimX, unsigned int blockDimY, unsigned int blockDimZ,
                        unsigned int sharedMemBytes, CUstream hStream, void** kernelParams, void** extra);

        *(void **)(&function) = dlsym (RTLD_NEXT, "cuLaunchKernel");

        cuLaunchKernelHelper (hStream);

        (*function)(f, gridDimX, gridDimY, gridDimZ, blockDimX, blockDimY, blockDimZ, sharedMemBytes, hStream, kernelParams, extra);

}

void cuLaunchKernelHelper (CUstream hStream) {
        // Nothing
        printf ("cuLaunchHelper\n");
}

After that, I run the matrix multiplication example in CUDA samples as below:

LD_PRELOAD="stub.so" ./matrixMul

Unfortunately, nothing happens. I cannot capture this CUDA call. Now I'm wondering, is there anything specific with regard to CUDA binary codes that I need to consider?

talonmies
  • 70,661
  • 34
  • 192
  • 269
saman
  • 199
  • 4
  • 17
  • 1
    Your method works just fine for me when I run it on a driver API application that actually calls `cuLaunchKernel`, such as `vectorAddDrv` – Robert Crovella Apr 23 '18 at 01:33

1 Answers1

1

Now I'm wondering, is there anything specific with regard to CUDA binary codes that I need to consider?

Yes. You need to ensure that the code which you are applying the LD_PRELOAD trick to actually calls cuLaunchKernel and was dynamically linked against the CUDA driver API library. You will find that runtime API applications compiled with nvcc (which are the bulk of CUDA code you will encounter) use compiler generated boilerplate which resolves to the deprecated cuLaunch API, not cuLaunchKernel. This is why you see nothing, because your code intercepts an API which isn't called in the application you are testing with and wasn't dynamically linked against the driver library.

As pointed out in comments, if you use a driver API application like any of the samples listed here, it should work as expected.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • I found the problem. CUDA libraries itself using dlsym to get the functions to calls. Has nothing to do with intercepting the wrong functions. – saman Apr 23 '18 at 23:14
  • 1
    both observations are correct here, although the specific underlying behavior has changed over time, so may vary by CUDA version. On CUDA 9.1, `libcuda.so` is runtime-loaded by libcudart, and dlsym connections are made to a large set of functions (eg. `ltrace ./myexe 2>&1 |grep cu`). So specific function calls are not accessible via ordinary LD_PRELOAD intercept. However the statements in this answer are also correct, to wit: `You need to ensure that the code which you are applying the LD_PRELOAD trick to actually calls cuLaunchKernel.` The code *you write* must call this, for this to work. – Robert Crovella Apr 23 '18 at 23:44
  • 1
    The LD_PRELOAD method allows you to intercept calls into a *dynamically linked* library. `libcuda.so` is not dynamically linked (use `ldd`to verify) to the typical CUDA runtime API application (instead, to the extent that it is used, it is *runtime loaded*). It *is* typically dynamically linked to a CUDA driver API application (`cuLaunchKernel` is part of the CUDA driver API). If you build a cuda runtime application with dynamic linking to libcudart (which is not the default compile option), you should be able to intercept a launch call from the runtime API, such as `cudaLaunch`. – Robert Crovella Apr 23 '18 at 23:55