Is there any way to fuse fully connected layer(gemm) and activation layer(relu/sigmoid) on gpu in dnn?

Question

Usually one layer in dnn consists of MatMul, BiasAdd, Relu, cuBlas provides Gemm for MatMul, and we can do BiasAdd and Relu in another kernel for GPU. They are two GPU lanuch calls, is there any way to fuse them all togather and make them just one? I looked into cuBlas, cudnn, but not found anything. I think it's not difficult because BiasAdd and Relu are just element-wise operaions, and fusion makes it more efficient.

Here is the backgroud:

I am working on a online prediction service which is multi dnn model ensemble. By profiling my program, I found out that both my CPU and GPU is not fully utilized, but requests blocks on GPU-related function call (like lanuchKernel). It seems like there's a big lock in libcuda. I am using tensorflow, XLA enabled, so I use nvprof and tensorflow HLO to visialize GPU-call, and there'are only dot and fused(which is biasadd and relu) operations. Although kernel fusion is done, there're still too many lanuchKernel calls, and GPU utilization is only 60%. I tried multi cuda context in one process, the improvement is trivial.

By the way, I am using one single GPU, Tesla P100.

I don't see how to do with without writing your own custom kernel which combines gemm with whatever element-wise operations you need to perform. And writing a performant gemm kernel is a non-trivial exercise — talonmies, Oct 25 '17 at 10:59
libcuda is really a blackbox that I can't tell what should I do to with it. Kernel fusion might be a option I think but I won't try it, too complicated for hand-writing cuda XD . Maybe there're other solutions. — Kan Liu, Oct 25 '17 at 12:09
Sure, there's lots of ways to optimize CUDA with things like streams, concurrent execution, async memcopies, batched matrix multiplies, etc. But you're not in control of the source code, so what do you expect to do? — Aleksandr Dubinsky, Nov 10 '17 at 13:24
Things like streams, concurrent execution, async memcopies, batched matrix multiplies are already in use. I don't know what to do next to avoid cuda driver level mutex contention. Even I don't know why there's a mutex, maybe lockless queue would be better.Maybe customize gemm kernel based of existing high performance version is a reasonable choice. — Kan Liu, Nov 12 '17 at 17:03

Is there any way to fuse fully connected layer(gemm) and activation layer(relu/sigmoid) on gpu in dnn?

0 Answers0