6

The paper "Natural Language Processing with Small Feed-Forward Networks" https://arxiv.org/pdf/1708.00214.pdf states:

enter image description here

I've implemented quantization as per the above equations in python:

b = 128

embedding_matrix = [[20000,3000,1000],[1999999,20000,1999999], [20000,3000,1000]]

scaled = [ abs(round( (1 / (b - 1) * max(e)) , 3)) for e in embedding_matrix]

print(scaled)

i = 0

quantized = []
for e in embedding_matrix :
    for v in e : 
        quantized.append((v , math.floor(.5 + ( (v / scaled[i]) + b) )))
    i = i + 1
    
quantized

Running this code quantized is set to :

[(20000, 255),
 (3000, 147),
 (1000, 134),
 (1999999, 255),
 (20000, 129),
 (1999999, 255),
 (20000, 255),
 (3000, 147),
 (1000, 134)]

How to de-quantize back to the original values prior to quantization ?

Reading https://www.tensorflow.org/api_docs/python/tf/quantization/dequantize describes :

tf.quantization.dequantize(
    input, min_range, max_range, mode='MIN_COMBINED', name=None, axis=None,
    narrow_range=False, dtype=tf.dtypes.float32
)

[min_range, max_range] are scalar floats that specify the range for the output. The 'mode' attribute controls exactly which calculations are used to convert the float values to their quantized equivalents.

and the PyTorch docs: https://pytorch.org/docs/stable/quantization.html

Seems to implement quantize differently to above implementation ?

ardito.bryan
  • 429
  • 9
  • 22
blue-sky
  • 51,962
  • 152
  • 427
  • 752

1 Answers1

3

What they are doing in the paper is roughly this:

import numpy as np

b = 128

embedding_matrix = np.array([[20000,3000,1000,1000],[1999999,20000,1999999,1999999], [20000,3000,1000,1000]])
scales = (np.abs(embedding_matrix).max(axis=1) / (b-1)).reshape(-1, 1)
quantized = (embedding_matrix / scales + b + 0.5).astype(np.uint8)
dequantized = (quantized - b) * scales
print(quantized)
print(dequantized)

Output:

[[255 147 134 134]
 [255 129 255 255]
 [255 147 134 134]]
[[2.00000000e+04 2.99212598e+03 9.44881890e+02 9.44881890e+02]
 [1.99999900e+06 1.57480236e+04 1.99999900e+06 1.99999900e+06]
 [2.00000000e+04 2.99212598e+03 9.44881890e+02 9.44881890e+02]]

In short they just have q_ij = round(e_ij / s_i + b), so after you just have quantized value q_ij your best approximation is to say that q_ij = dequantized_ij / s_i + b, so dequantized_ij = (q_ij - b) * s_i

As to pytorch - similar functionality is available with torch.quantize_per_channel e.g the following code is doing pretty much the same:

import torch
t = torch.tensor(embedding_matrix, dtype=torch.float32)
zero_point = torch.tensor([b]).repeat(t.shape[0], 1).reshape(-1)
quantized_tensor = torch.quantize_per_channel(t, t.abs().max(axis=1)[0] / (b-1), zero_point, 0, torch.quint8)
print(quantized_tensor)
print(quantized_tensor.int_repr())

Output:

tensor([[2.0000e+04, 2.9921e+03, 9.4488e+02, 9.4488e+02],
        [2.0000e+06, 1.5748e+04, 2.0000e+06, 2.0000e+06],
        [2.0000e+04, 2.9921e+03, 9.4488e+02, 9.4488e+02]], size=(3, 4),
       dtype=torch.quint8, quantization_scheme=torch.per_channel_affine,
       scale=tensor([  157.4803, 15748.0234,   157.4803], dtype=torch.float64),
       zero_point=tensor([128, 128, 128]), axis=0)
tensor([[255, 147, 134, 134],
        [255, 129, 255, 255],
        [255, 147, 134, 134]], dtype=torch.uint8)

If quantized per channel like this in pytorch you can only apply .dequantize() on the full tensor rather then the sliced which wouldn't be a good thing for embeddings, but you can do it manually very easy using repr_int, q_per_channel_zero_points, and q_per_channel_scales.

Does this answer your question?

Alexander Pivovarov
  • 4,850
  • 1
  • 11
  • 34
  • 1
    So it's really just de-scaling, not de-quantizing? – user2357112 Jun 21 '20 at 00:40
  • Well, this is just simple linear quantization. To store values as uint8 they are pretty much scaling them to fit into 0-255 range and rounded to nearest integer. – Alexander Pivovarov Jun 21 '20 at 00:44
  • Since we already lost precision by truncating part of the number the only thing we can do to restore the value it represent is to scale it back. So it is de-quantizing which only has to apply de-scaling (+ translation since zero is represented by `b`). – Alexander Pivovarov Jun 21 '20 at 00:45
  • @AlexanderPivovarov what ensures the values are scaled to fit into the 0-255 range ? Is it related to the bias b value (128) ? – blue-sky Jun 21 '20 at 17:52
  • 1
    Yes, also the values will never be 0 (that seems to be consistent with PyTorch implementation as well, and PyTorch is reserving 0 quantized value for things like `nan`, `inf`). Due to the way how the scales are defined, the value `e_ij / s_i` (in the paper's terms) is guaranteed to be between `-(b-1)` and `b-1`, then after adding `0.5 + b` it will be between `1.5` and `2b - 0.5` - so after truncating to integer it will always be between `1` and `2b-1` so in this case between `1` and `255`. Zero value will be always quantized as `b` i.e. `128` here. – Alexander Pivovarov Jun 21 '20 at 21:22
  • @AlexanderPivovarov why is "the value e_ij / s_i (in the paper's terms) is guaranteed to be between -(b-1) and b-1" ? Is it because of how 's_i' , the scale factor is calculated ? – blue-sky Jun 23 '20 at 08:43
  • 1
    @blue-sky yes, `s_i` is defined as `1/(b-1) * max_j (abs(e_ij))`. Start from `abs(e_ij) <= max_j (abs(e_ij))`, then divide both sides by `s_i` and you get `abs(e_ij / s_i) <= max_j (abs(e_ij)) / s_i`. But we know that `max_j (abs(e_ij)) / s_i` is equal to `b-1`, so we get `abs(e_ij / s_i) <= (b-1)` and thus `-(b-1) <= e_ij / s_i <= b-1`. It is important to note that this type of quantization is not updated dynamically - it is just applied after the weights were trained to store trained model in a more compact way. – Alexander Pivovarov Jun 23 '20 at 14:10