Run quantized tensorflow model on FPGA / pure python

Question

I have a model trained in keras which is a simple model trained on MNIST dataset.

What I try to do is to rewrite this model and run on FPGA device. In order to do this I want to fully understand how quantized model works.

First I converted this model with post training quantization to .tflite format and UINT8 precision (https://www.tensorflow.org/lite/performance/post_training_quantization).

So I have quantized model and accuracy is about 90%.

Now I try to get weights from quantized model and implement it in a pure python. I use this tool for visualization and to get model weights: https://github.com/lutzroeder/netron.

Although simple python code (matrix multiplication, add bias and relu) works, the one with quantized weights doesn't work.

So my question is how to write a feed forward using numpy?

My model in keras looks like this:

model = Sequential()
model.add(Dense(512, input_shape=input_shape))
model.add(Activation(tf.nn.relu))
model.add(Dense(100))
model.add(Activation(tf.nn.relu))
model.add(Dense(num_classes))
model.add(Activation(tf.nn.softmax))
model.compile(
    optimizer=Adam(),
    loss='categorical_crossentropy',
    metrics=['accuracy'],
)

I converted it with TocoConverter. And it works in tensorflow.

Then I try to write feed forward in pure python:

for img, label in zip(x_test, y_test):
    img = img.astype('uint8')
    total_seen += 1
    label = tf.keras.utils.to_categorical(label, num_classes=num_classes)
    X = img.reshape(1, 784)
    z1 = np.dot(X, W0.T) + b0
    a1 = relu(z1)
    z2 = np.dot(a1, W1.T) + b1
    a2 = relu(z2)
    z3 = np.dot(a2, W2.T) + b2
    prediction = np.argmax(z3)
    label = np.argmax(label)
    if prediction == label:
        num_correct += 1

But this model accuracy is about 10%, so something goes wrong. How to correct this model?

Thanks in advance.

Edit: I've read paper about quantization in tensorflow: http://openaccess.thecvf.com/content_cvpr_2018/papers/Jacob_Quantization_and_Training_CVPR_2018_paper.pdf

And I know almost everything, I know what are S and Z values for activations and kernels. But after matrix multiplication it should be multiplied by factor: M :=S1*S2/S3. And i don't know what is S3 scale and how to get it. Because i can't see anything related in netron graph. Any suggestion?

Please add the weight code you try. Even better adding some simple examples so that people can see where the problem lies at. — E.Coms, Nov 21 '18 at 22:24
Did you manage to implement the model on FPGA? I am trying to do the same, but cannot figure out proper calculations flow. — Nazar, Dec 05 '19 at 19:21

score 0 · Answer 1 · answered Dec 12 '18 at 12:39

There are two steps you'll need to do:

Dequantize the input, weights and bias back into full precision (or integer equivalent)

(w-w_offset)*w_scale
After the Relu, quantize the activations back into integer

a/a_scale+a_offset

You can probably skip step 2 that quantize-dequantize the activations with minor risk of getting different result as TFlite model. This is because Relu has no upper bound but TFlite will saturate it to a maximum value.

You can check out my tutorials on TFlite in my Github where I have introduced the concept and training and is about to write out about inference.

Run quantized tensorflow model on FPGA / pure python

1 Answers1