2

I am compiling a git version of the MXNet framework, which use CuDNN inside its code. Whenever MXNet is compiled in debug, my example test is running fine and my neural network is training. However, when I switch to release mode, the execution fails a test and I get the following error: Check failed: e == CUDNN_STATUS_SUCCESS (8 vs. 0) cuDNN: CUDNN_STATUS_EXECUTION_FAILED.

Note: I don't see any release/debug code which could explain a different behaviour. And I didn't had any problem at all with both release and debug version until I activated CuDNN, thus I trust it is the culprit.

The symptoms: - The code doesn't necessarily crash at the same location. But it is always during a CUDNN_CALL (which is a macro that calls a CuDNN function and check the status). - No memory is allocated on my GPU, which has anyway enough memory for such network, thus it shouldn't be a problem. - It happens only in release - in debug, it is running just fine.

Here is an example of where I get the error:

CUDNN_CALL(cudnnAddTensor(s->dnn_handle_,
                                &alpha,
                                bias_desc_,
                                bias.dptr_ + bias_offset_ * g,
                                &beta_add,
                                out_desc_,
                                out_ptr + out_offset_ * g));

So, what could be the causes of such a problem?

Emile D.
  • 602
  • 2
  • 11
  • 20
  • Can you share your method for building from source? Is it exactly as outlined in the docs [found here](http://mxnet.incubator.apache.org/versions/master/install/ubuntu_setup.html)? – Thom Lane Apr 04 '19 at 23:37
  • Hi Thom. Thank you for your time. Unfortunately, the build I use is a bit more complex and comes from an older version of MXNet (1.2.0). Besides, it is for a C++ build. So I don't think that would be a good idea to detail it there. However, my question relates more to what could trigger such specific behaviour of CuDNN, because I don't think it is a problem of MXNet per se. – Emile D. Apr 04 '19 at 23:42
  • Could you try setting the following environment `MXNET_ENGINE_TYPE= NaiveEngine` to see if you still get the issue? It could be a difference between debug and production, and might give you a slightly more useful error! – Thom Lane Apr 05 '19 at 00:11
  • I'm confused now. It seems that you are right, this environment variable seems to solve the problem. Shall I conclude that it is a problem on MXNet side? But what could be the impact of such variable that would explain the disappearance of the problem? – Emile D. Apr 05 '19 at 14:26
  • `NaiveEngine` is typically slower because it doesn't perform operators in parallel. And I think this explains why things were working in debug mode too. Sounds like you've got an issue when you're queuing operations in `ThreadedEnginePerDevice` mode. Are you sure you're not queuing too many operations and running out of memory? – Thom Lane Apr 05 '19 at 18:39
  • Well, I don't see anything loading in the memory. And it is a FCN, smaller than AlexNet, which is being loaded onto an RTX2060. So I am quite confident it is not the issue. Is there a way to control the number of threads? Could that be a concurrency problem? – Emile D. Apr 06 '19 at 16:22

1 Answers1

3

For some reason, updating the version of CuDNN to 7.4 did the trick for me. So I guess, it was really a problem with CuDNN on my side. I can only hypothesize that a bug fix solved my problem, or I was using a version which was not fully compatible to my GPU, etc.

Emile D.
  • 602
  • 2
  • 11
  • 20
  • For some reason, I don't have better explanation that my answer, which is very sad. Please feel free to edit it and add details and information if you have some. – Emile D. Apr 11 '19 at 15:48