CuDNN crashing under valgrind

Question

My program works fine on my standard Ubuntu x64 box, but if I run under valgrind I see the following error:

==22246== Conditional jump or move depends on uninitialised value(s)
==22246==    at 0x9854DCBC: ??? (in /usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21)
==22246==    by 0x98182940: cudnnGetConvolutionBackwardFilterWorkspaceSize (in /usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21)
==22246==    by 0x982C3787: ??? (in /usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21)
==22246==    by 0x9817FC0F: cudnnGetConvolutionBackwardFilterAlgorithm (in /usr/lib/x86_64-linux-gnu/libcudnn.so.6.0.21)
==22246==    by 0x903F4908: caffe::CuDNNConvolutionLayer<float>::Reshape(std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&, std::vector<caffe::Blob<float>*, std::allocator<caffe::Blob<float>*> > const&) (cudnn_conv_layer.cpp:149)
==22246==    by 0x904F35D8: SetUp (layer.hpp:72)
==22246==    by 0x904F35D8: caffe::Net<float>::Init(caffe::NetParameter const&) (net.cpp:148)
==22246==    by 0x904F523F: caffe::Net<float>::Net(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, caffe::Phase, int, std::vector<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::allocator<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > > const*, caffe::Net<float> const*) (net.cpp:45)
--from here is my own code

Unfortunately, although my code is pretty much a bog-standard interface to caffe, the model is a complex proprietary model I cannot share here. Furthermore, CuDNN is closed-source, so I cannot debug that to see if this is a problem worth bothering about.

Googling cudnnGetConvolutionBackwardFilterWorkspaceSize valgrind and cudnnGetConvolutionBackwardFilterAlgorithm valgrind turns up nothing useful except a hint to add --track-origins=yes, but when I add that the error goes away...

The problem I am actually trying to solve is that the Deep Learning module crashes in the standard library on the target platform with a call to freeing already-freed memory. However, the target is an ARM-based device that I cannot get access to for further investigation.

Without source code I'm not sure what Stack Overflow can really provide here. I would encourage you to try and produce a [mcve]. You'll need this if you have to produce a bug report for your libraries anyway. — Jonathon Reinhart, Aug 21 '18 at 02:45
Yes, I realised that I would probably need a [MCVE], but I was hoping the function names might ring a bell with someone. Our model file also uses deprecated input fields, so I suppose the first step would be to change to input layers and see if that shuts `valgrind` up. — Ken Y-N, Aug 21 '18 at 03:32
Can you run in cpu-mode (if your'e using the python interface that's caffe.set_mode_cpu; can be specified in prototxt)? If so, and the results are roughly similar (they should be for standard layers), then valgrind might be possible. I'd caution that running in valgrind will make things *very* slow, and that it isn't entirely surprising to me that it doesn't work properly on heavily optimized gpu code — en_Knight, Aug 28 '18 at 21:16
Also, could you run under gdb instead of valgrind? Won't show you where memory leaks are happening but might at least give you a stacktrace. May be useful may not. Otherwise, the usual supsect for this is that you have a dimension mismatch somewhere, like claiming to have 10 softmax classes but passing in a label that is class 12, etc. — en_Knight, Aug 28 '18 at 21:17

CuDNN crashing under valgrind

0 Answers0