1

I write CUDA code that I call from MATLAB MEX files. I am not using any of MATLABs GPU libraries or capabilities. My code its just CUDA code that accepts C type variables and I only use mex to convert from mwtypes to C types, then call independent self-written CUDA code.

The problem is that sometimes, specially in development phase, CUDA fails (because I made a mistake). Most CUDA calls are generally surrounded by a call to gpuErrchk(cudaDoSoething(cuda)), defined as:

// Uses MATLAB functions but you get the idea.
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }

inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
    if (code != cudaSuccess)
    {
        mexPrintf("GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort){
            //cudaDeviceReset(); //This does not make MATLAB release it
            mexErrMsgIdAndTxt("MEX:myfun", ".");
        }
    }
}

While this works as expected, giving errors such as

GPUassert: an illegal memory access was encountered somefile.cu 208

In most cases MATLAB does not release the GPU afterwards. Meaning that even if I change the code and recompile, the next call of the code will result in error:

GPUassert: all CUDA-capable devices are busy or unavailable somefile.cu firs_cuda_line

The only way of removing this error is restarting MATLAB. This is just annoying and hinders the development/testing process. This is not what happens when I develop in say Visual Studio.

I have tried to cudaDeviceReset() both before and after the error has been raised, but to no avail.

What can I do/try to make MATLAB release the GPU after a GPU runtime error?

Dev-iL
  • 23,742
  • 7
  • 57
  • 99
Ander Biguri
  • 35,140
  • 11
  • 74
  • 120
  • How are you creating the CUDA context your code is using? – talonmies Jul 23 '18 at 11:25
  • If none of MATLAB's GPU libraries are in use, then it is not MATLAB that needs to release the GPU. – Ben Voigt Jul 23 '18 at 11:31
  • @talonmies How as in you want the code or general description? In short I have a .cu file with a "main". This only accepts C types, and outputs C types. Everything CUDA (memcpy, free, kernels, etc) happens in this file. MATLAB has, in theory, no knowledge of it using CUDA (just for compilation where I link `nvcc` compiler to MVS in mex). The only MATLAB code in this file is the one showed, to error and print. – Ander Biguri Jul 23 '18 at 11:32
  • No how as in are you using lazy context establishment or are you doing explicit context management via the runtime API, or some combination of the two. – talonmies Jul 23 '18 at 11:33
  • @BenVoigt indeed, that Is why I tried some random stabs in the dark such as the `cudaDeviceReset()`. The fact is however that when I take the file and put it in a MVS project, create a mock C input data and run it, this does not happen. – Ander Biguri Jul 23 '18 at 11:34
  • Your question really is about cleaning up CUDA resources after an error, with the fact that you are writing a DLL somewhat relevant, and the fact the DLL is called from MATLAB is really beside the point and not worth a tag. – Ben Voigt Jul 23 '18 at 11:34
  • @BenVoigt lets see that after we find a solution ;) – Ander Biguri Jul 23 '18 at 11:34
  • @talonmies lazy context management, if I understand correctly. I dont select the GPU, nor I reset it (I do free the memory though). – Ander Biguri Jul 23 '18 at 11:35
  • For sure it is important whether the host application exits, ending the process, or keeps running. Which host application, isn't a critical detail. – Ben Voigt Jul 23 '18 at 11:35
  • 1
    @BenVoigt I had this exact conversation 4 days ago in another post. Mods decide to leave the tag as it *may* be relevant. Unless you do know the solution, the fact that is MATLAB may be the source ( as I can only reproduce it there) – Ander Biguri Jul 23 '18 at 11:36
  • Fair enough regarding the tag, but I strongly advise that when researching the problem, you also do some searches about error recovery when using CUDA in a DLL -- without MATLAB as a search keyword -- because you'll almost certainly find more information that way. – Ben Voigt Jul 23 '18 at 11:39
  • @BenVoigt thanks, I did, but I have no find anything. Feel free to link something If I missed it. – Ander Biguri Jul 23 '18 at 11:39
  • 1
    [This link](https://docs.nvidia.com/cuda/cuda-runtime-api/driver-vs-runtime-api.html) supports @talonmies suggestion that implicit context management might be part of your problem. Particularly the last paragraph about writing plugins and `cudaDeviceReset` affecting future CUDA calls. – Ben Voigt Jul 23 '18 at 11:44
  • 1
    I am going to guess you need to parse the documentation of `mexerrmsgidandtxt` very carefully. It might well be that it does not trigger the necessary lazy context teardown that happens when the equivalent of `std::exit` is called. You might either need something else from matlab to explicitly unload the mex from memory or perform explicit context management yourself. I am going to disagree with @BenVoigt here and suggest this is some specific matlab hebaviour – talonmies Jul 23 '18 at 11:46
  • @talonmies: No one expects `mexErrMsgIdAndTxt` to cause context teardown. It just saves an error object which the MATLAB interpreter will look at when it regains control. It doesn't end the process like `std::exit`. The point is how to recover from errors without a process restart... and I suspect the answer is the same for any DLL called from a long-running host, no matter which application is the host. `clear mex` would unload the MEX DLL, which might unload its dependent DLLs, but I'm sure Ander already does `mex clear` when recompiling. – Ben Voigt Jul 23 '18 at 11:53
  • 1
    (I'm using the term DLL, which is fairly Windows-centric, because of the mentions of Visual Studio. For Linux, think in terms of "shared object" instead) – Ben Voigt Jul 23 '18 at 11:53
  • @BenVoigt Indeed `mex clear` does not do the job. Thank you and @talonmies for the suggestions on the context. I will try to experiment that way see if I can fix it with explicit context management. – Ander Biguri Jul 23 '18 at 12:21

0 Answers0