Why is quantized graph inference takes much more time than using the original graph?

Question

I followed this tutorial in order to quantize my graph into 8 bit.I can't share the exact graph here but i can say it's a simple convolutional neural network.

When i run the benchmark tool over the original and quantized networks it's clear that the quantized network is much much slower (100 ms vs. 4.5 ms).

Slowest nodes in original network :

time average [ms]   [%] [cdf%]  [Op]    [Name]
1.198   26.54%  26.54%  MatMul  fc10/fc10/MatMul
0.337   7.47%   34.02%  Conv2D  conv2/Conv2D
0.332   7.36%   41.37%  Conv2D  conv4/Conv2D
0.323   7.15%   48.53%  Conv2D  conv3/Conv2D
0.322   7.14%   55.66%  Conv2D  conv5/Conv2D
0.310   6.86%   62.53%  Conv2D  conv1/Conv2D
0.118   2.61%   65.13%  Conv2D  conv2_1/Conv2D
0.105   2.32%   67.45%  MaxPool pool1

Slowest nodes in quantized network :

time average [ms]   [%] [cdf%]  [Op]    [Name]
8.289   47.67%  47.67%  QuantizedMatMul fc10/fc10/MatMul_eightbit_quantized_bias_add
5.398   5.33%   53.00%  QuantizedConv2D conv5/Conv2D_eightbit_quantized_conv
5.248   5.18%   58.18%  QuantizedConv2D conv4/Conv2D_eightbit_quantized_conv
4.981   4.92%   63.10%  QuantizedConv2D conv2/Conv2D_eightbit_quantized_conv
4.908   4.85%   67.95%  QuantizedConv2D conv3/Conv2D_eightbit_quantized_conv
3.167   3.13%   71.07%  QuantizedConv2D conv5_1/Conv2D_eightbit_quantized_conv
3.049   3.01%   74.08%  QuantizedConv2D conv4_1/Conv2D_eightbit_quantized_conv
2.973   2.94%   77.02%  QuantizedMatMul fc11/MatMul_eightbit_quantized_bias_add

What is the reason for that ? I'm using tensorflow version compiled from source, without gpu support.

Are you running on GPU? If you are, the float graph will be placed on GPU resulting in a speedup, but Quantized ops currently don't have GPU implementations so they will be placed on CPU resulting in a slowdown. Perhaps take a look at your op placement and let us know? — suharshs, Oct 23 '17 at 23:58

score 1 · Answer 1 · answered Jul 17 '17 at 00:29

1

https://github.com/tensorflow/tensorflow/issues/2807

Check the comments here. It seems that quantization isn't yet optimized for x86. My quantized inception resnet v2 runs slower than the original too.

answered Jul 17 '17 at 00:29

Sungsu Lim

13
3

Why is quantized graph inference takes much more time than using the original graph?

1 Answers1