3

TensorFlow website claims that Quantization provides up to 3x lower latency on mobile devices: https://www.tensorflow.org/lite/performance/post_training_quantization

I tried to verify this claim, and found that Quantized models are 45%-75% SLOWER than Float models in spite of being almost 4 times smaller in size. Needless to say, this is very disappointing and conflicts with Google's claims.

My test uses Google's official MnasNet model: https://storage.googleapis.com/mnasnet/checkpoints/mnasnet-a1.tar.gz

Here is the average latency based on 100 inference operations on a freshly rebooted phone:

  • Pixel 2: float=81ms, quant=118ms
  • Moto E: float=337ms, quant=590ms
  • LG Treasure: float=547ms, quant=917ms

My test app measures the timing for only one method (tfLite.runForMultipleInputsOutputs). The results are very consistent (within 1% across multiple executions).

I am hoping to see some comments from the Tensorflow team or anybody who can share their metrics. The numbers above are based on image classifier model. I also tested an SSD MobileNetV2 object detector with similar results (quantized model being substantially slower).

Dennis Kashkin
  • 449
  • 5
  • 15
  • Same here, any insights since the post? – koltun Jan 07 '20 at 19:47
  • I posted this question in hope to hear from some proud googler but it's been 9 months of silence :( OK GOOGLE! – Dennis Kashkin Feb 21 '20 at 20:45
  • Actually recently Google released a banchmark tool that can be used to profile tflite models, in my case I found that using the conv2d_transpose operation took 90% of the model's execution time so I just replaced it with ResizeBillinear op and a convolution, which made the model 10x faster :) https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/tools/benchmark – koltun Feb 26 '20 at 11:51

0 Answers0