What are the differences between tf.fake_quant_with_min_max_args and tf.fake_quant_with_min_max_vars

Question

I would love to understand the differences between the tensorflow functions

tf.fake_quant_with_min_max_args
tf.fake_quant_with_min_max_vars

As in their API they have almost the same description. I usually quantize manually the required nodes through tf.fake_quant_with_min_max_vars, though I am not sure whether it is correct.

Should weights, for example, use tf.fake_quant_with_min_max_args?

Similarly, looking at the code of quantize.Quantize, I do understand that basically it iterates through the graph, find the compatible tensors and add nodes for identity/quantization depending on the global_step. However, should I understand that not all operations are quantized (for example, conv1d, though conv2d and mat/mul are). Will the library support all the operations in the future?

score 5 · Answer 1 · answered Jun 02 '18 at 23:07

Regarding the naming, it is a bit of a misnomer. The 'args' variant uses attributes to express min/max and is therefore only valid for fixed ranges. The 'vars' variants take arbitrary tensors for min/max. Whether these are actual vars or some other computed value depends on your quantization approach. The 'vars' variants have gradients for their min/max and can therefore be trained. A lot of training approaches just compute them at training time using min/max of the batch and then accumulate these into non trainable vars using an exponential moving average. Then at eval time, the min/max vars are used in place of the computed min/max.

If adding them manually, you need to make sure that the inputs to all arithmetic ops (add, mul, etc but not transpose, reshape, etc) have an appropriate fake_quant* op on the tensors that feed in to it.

In practice, the rule I've found that works for this is:

when a weight var feeds into an arithmetic op, add a fake_quant_with_min_max_vars that computed its min/max from the min/max of the weight.
add a fake_quant_with_min_max_vars after any arithmetic op that accumulates into dedicated min/max vars for each op at training time and just uses the vars at eval time.
add an appropriate fake_quant* op to the very top level inputs to your model (not necessary if it is a model that is driven via some form of embedding lookup). This includes constants coming in unless if they are the default range.

If you do it in this way, you'll generally be in a situation where every tensor is quantized without redundant/conflicting quant params. Depending on the model, there can be additional nuance and other tricks needed to actually get toco/tflite to be able to run it with only quantized types.

I'm less familiar with the automated tools that do this, but I believe this is the general approach they take when rewriting the graph. They also have some significant complexity to detect and work around certain patterns that need extra massaging when trying to do a transformation in the blind at the graphdef level (as opposed to the source level where some things are more obvious).

For the "manual" approach to not be too burdensome, I've written/used libraries that just let me annotate the important tensors by passing them through helper functions that defer to a model level set of parameters that let me tune the quantization strategy layer by layer.

Hth.

What are the differences between tf.fake_quant_with_min_max_args and tf.fake_quant_with_min_max_vars

1 Answers1