GPU slower than CPU when it comes to theano prediction?

Question

I was using keras + theano to predict labels from a VGG pre-trained model on NVidia TK1.

I am getting a faster prediction time from CPU than from GPU in predictions. If my memory is correct, prediction also involves a lot of number crunching in a repetitive way. I don't understand why CPU would be slower here though.

Does anyone have a good explanation?

The GPU detail line: Using gpu device 0: GK20A (CNMeM is enabled with initial size: 75.0% of memory, cuDNN Version is too old. Update to v5, was 2000.)

Here comes the profiling result of the prediction:

Class
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Class name>
  39.5%    39.5%       0.019s       6.42e-03s     C        3       3   theano.sandbox.cuda.blas.GpuDot22
  24.8%    64.3%       0.012s       6.04e-03s     C        2       2   theano.sandbox.cuda.blas.GpuCorrMM
  16.4%    80.8%       0.008s       1.33e-03s     C        6       6   theano.sandbox.cuda.basic_ops.GpuElemwise
   7.8%    88.5%       0.004s       1.89e-03s     C        2       2   theano.sandbox.cuda.blas.GpuDownsampleFactorMax
   4.2%    92.7%       0.002s       2.03e-03s     C        1       1   theano.sandbox.rng_mrg.GPU_mrg_uniform
   3.8%    96.4%       0.002s       4.57e-04s     C        4       4   theano.sandbox.cuda.basic_ops.GpuContiguous
   2.3%    98.8%       0.001s       5.66e-04s     C        2       2   theano.sandbox.cuda.basic_ops.GpuFromHost
   0.5%    99.3%       0.000s       2.51e-04s     C        1       1   theano.sandbox.cuda.nnet.GpuSoftmaxWithBias
   0.5%    99.8%       0.000s       2.39e-04s     C        1       1   theano.sandbox.cuda.basic_ops.HostFromGpu
   0.1%    99.8%       0.000s       1.37e-05s     C        3       3   theano.sandbox.cuda.basic_ops.GpuReshape
   0.0%    99.9%       0.000s       9.54e-06s     C        2       2   theano.sandbox.cuda.basic_ops.GpuSubtensor
   0.0%    99.9%       0.000s       4.35e-06s     C        4       4   theano.tensor.elemwise.Elemwise
   0.0%    99.9%       0.000s       5.01e-06s     C        2       2   theano.sandbox.cuda.basic_ops.GpuDimShuffle
   0.0%   100.0%       0.000s       3.26e-06s     C        3       3   theano.compile.ops.Shape_i
   0.0%   100.0%       0.000s       4.53e-06s     C        2       2   theano.tensor.opt.MakeVector
   0.0%   100.0%       0.000s       5.96e-06s     C        1       1   theano.tensor.elemwise.Prod
   0.0%   100.0%       0.000s       3.10e-06s     C        1       1   theano.tensor.elemwise.DimShuffle
   ... (remaining 0 Classes account for   0.00%(0.00s) of the runtime)

Ops
---
<% time> <sum %> <apply time> <time per call> <type> <#call> <#apply> <Op name>
  39.5%    39.5%       0.019s       6.42e-03s     C        3        3   GpuDot22
  24.8%    64.3%       0.012s       6.04e-03s     C        2        2   GpuCorrMM{valid, (1, 1)}
  11.2%    75.5%       0.005s       1.36e-03s     C        4        4   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)]
   7.8%    83.3%       0.004s       1.89e-03s     C        2        2   GpuDownsampleFactorMax{(2, 2),True}
   4.2%    87.4%       0.002s       2.03e-03s     C        1        1   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}
   3.8%    91.2%       0.002s       4.57e-04s     C        4        4   GpuContiguous
   2.9%    94.1%       0.001s       1.43e-03s     C        1        1   GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)]
   2.3%    96.5%       0.001s       5.66e-04s     C        2        2   GpuFromHost
   2.3%    98.8%       0.001s       1.12e-03s     C        1        1   GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)]
   0.5%    99.3%       0.000s       2.51e-04s     C        1        1   GpuSoftmaxWithBias
   0.5%    99.8%       0.000s       2.39e-04s     C        1        1   HostFromGpu
   0.1%    99.8%       0.000s       1.60e-05s     C        2        2   GpuReshape{4}
   0.0%    99.9%       0.000s       9.54e-06s     C        2        2   GpuSubtensor{::, ::, ::int64, ::int64}
   0.0%    99.9%       0.000s       5.01e-06s     C        2        2   GpuDimShuffle{x,0}
   0.0%    99.9%       0.000s       4.53e-06s     C        2        2   MakeVector{dtype='int64'}
   0.0%    99.9%       0.000s       9.06e-06s     C        1        1   GpuReshape{2}
   0.0%    99.9%       0.000s       4.17e-06s     C        2        2   Elemwise{Composite{((i0 + ((i1 + i2) // i3)) // i3)}}[(0, 2)]
   0.0%   100.0%       0.000s       5.96e-06s     C        1        1   Prod{acc_dtype=int64}
   0.0%   100.0%       0.000s       5.96e-06s     C        1        1   Elemwise{Cast{float32}}
   0.0%   100.0%       0.000s       5.01e-06s     C        1        1   Shape_i{0}
   ... (remaining 4 Ops account for   0.02%(0.00s) of the runtime)

Apply
------
<% time> <sum %> <apply time> <time per call> <#call> <id> <Apply name>
  36.4%    36.4%       0.018s       1.77e-02s      1    33   GpuDot22(GpuReshape{2}.0, dense_5_W)
  15.7%    52.1%       0.008s       7.64e-03s      1    18   GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
   9.1%    61.2%       0.004s       4.44e-03s      1    28   GpuCorrMM{valid, (1, 1)}(GpuContiguous.0, GpuContiguous.0)
   5.7%    66.9%       0.003s       2.76e-03s      1    25   GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
   4.2%    71.0%       0.002s       2.03e-03s      1    20   GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}(<CudaNdarrayType(float32, vector)>, MakeVector{dtype='int64'}.0)
   3.6%    74.6%       0.002s       1.74e-03s      1    34   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
   3.2%    77.8%       0.002s       1.54e-03s      1    22   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
   2.9%    80.7%       0.001s       1.43e-03s      1    23   GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)](GPU_mrg_uniform{CudaNdarrayType(float32, 4D),inplace}.1, CudaNdarrayConstant{[[[[ 0.80000001]]]]})
   2.7%    83.4%       0.001s       1.29e-03s      1    36   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[ 0.5]]}, GpuDot22.0, GpuDimShuffle{x,0}.0)
   2.3%    85.7%       0.001s       1.12e-03s      1    31   GpuElemwise{Composite{Switch(i0, (i1 * i2 * i3), i2)}}[(0, 2)](GpuFromHost.0, CudaNdarrayConstant{[[[[ 1.25]]]]}, GpuDownsampleFactorMax{(2, 2),True}.0, GpuElemwise{Composite{Cast{float32}(LT(i0, i1))}}[(0, 0)].0)
   2.2%    87.8%       0.001s       1.06e-03s      1    14   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   2.2%    90.0%       0.001s       1.06e-03s      1    35   GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_6_W)
   2.1%    92.1%       0.001s       1.01e-03s      1    30   GpuDownsampleFactorMax{(2, 2),True}(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0)
   2.0%    94.1%       0.001s       9.61e-04s      1     3   GpuFromHost(convolution2d_input_1)
   1.8%    95.9%       0.001s       8.71e-04s      1    29   GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)](CudaNdarrayConstant{[[[[ 0.5]]]]}, GpuCorrMM{valid, (1, 1)}.0, GpuReshape{4}.0)
   1.6%    97.4%       0.001s       7.58e-04s      1    15   GpuContiguous(GpuSubtensor{::, ::, ::int64, ::int64}.0)
   1.0%    98.4%       0.000s       4.72e-04s      1    37   GpuDot22(GpuElemwise{Composite{(i0 * ((i1 + i2) + Abs((i1 + i2))))}}[(0, 1)].0, dense_7_W)
   0.5%    98.9%       0.000s       2.51e-04s      1    38   GpuSoftmaxWithBias(GpuDot22.0, dense_7_b)
   0.5%    99.4%       0.000s       2.39e-04s      1    39   HostFromGpu(GpuSoftmaxWithBias.0)
   0.3%    99.8%       0.000s       1.70e-04s      1    19   GpuFromHost(Elemwise{Cast{float32}}.0)
   ... (remaining 20 Apply instances account for 0.25%(0.00s) of the runtime)

Do you have all the cores enabled? https://devtalk.nvidia.com/default/topic/762257/embedded-systems/cpu-cores-offline-had-to-manually-enable-/ — Selly, Apr 17 '17 at 23:04
Additionally--you may want to see if the GPU has a fixed clock rate? http://elinux.org/Jetson/Performance#Controlling_GPU_performance — Selly, Apr 17 '17 at 23:07

score 1 · Answer 1 · answered Apr 18 '17 at 19:38

It turns out that when keras + theano is doing prediction, the first time is the slowest. After loading the model with keras.models.load_model, the model appears not totally in memory yet, and the fist prediction call will handle the rest of the setting up.

After the first prediction, the rest of the predictions goes super fast. The fist prediction for the VGG model takes 3 secs or so, but the subsequent predictions take .15 to .2 seconds. All good now.

GPU slower than CPU when it comes to theano prediction?

1 Answers1