Tflite inference is very slower than keras model inference

Question

I converted the keras model to tflite. I am converting model like this

from keras import backend as K
from keras.models import load_model
from keras.engine.base_layer import Layer
import tensorflow as tf
# This line must be executed before loading Keras model.
K.set_learning_phase(0)

# custom layer
class Mish(Layer):
    '''
    Mish Activation Function.
    .. math::
        mish(x) = x * tanh(softplus(x)) = x * tanh(ln(1 + e^{x}))
    Shape:
        - Input: Arbitrary. Use the keyword argument `input_shape`
        (tuple of integers, does not include the samples axis)
        when using this layer as the first layer in a model.
        - Output: Same shape as the input.
    Examples:
        >>> X_input = Input(input_shape)
        >>> X = Mish()(X_input)
    '''

    def __init__(self, **kwargs):
        super(Mish, self).__init__(**kwargs)
        self.supports_masking = True

    def call(self, inputs):
        # return inputs * K.tanh(K.softplus(inputs))
        # return inputs * tf.tanh(tf.log(1 + tf.exp(inputs)))
        return inputs * K.tanh(K.log(1 + K.exp(inputs)))

    def get_config(self):
        config = super(Mish, self).get_config()
        return config

    def compute_output_shape(self, input_shape):
        return input_shape

model = load_model('./keras_model/yolo4.h5', custom_objects={"Mish":Mish})


def freeze_session(session, keep_var_names=None, output_names=None, clear_devices=True):
    from tensorflow.python.framework.graph_util import convert_variables_to_constants
    graph = session.graph
    with graph.as_default():
        freeze_var_names = list(set(v.op.name for v in tf.global_variables()).difference(keep_var_names or []))
        output_names = output_names or []
        output_names += [v.op.name for v in tf.global_variables()]
        # Graph -> GraphDef ProtoBuf
        input_graph_def = graph.as_graph_def()
        if clear_devices:
            for node in input_graph_def.node:
                node.device = ""
        frozen_graph = convert_variables_to_constants(session, input_graph_def,
                                                        output_names, freeze_var_names)
        return frozen_graph

frozen_graph = freeze_session(K.get_session(),
                              output_names=[out.op.name for out in model.outputs])


tf.train.write_graph(frozen_graph, "frozen", "tf_model_l0.pb", as_text=False)

converter = tf.lite.TFLiteConverter.from_frozen_graph('frozen/tf_model_l0.pb', 
            input_arrays=['input_1'], 
            output_arrays=["conv2d_110/BiasAdd","conv2d_102/BiasAdd","conv2d_94/BiasAdd"]  
        )

tfmodel = converter.convert() 
open ("model5.tflite" , "wb").write(tfmodel)

Above one is conversion script. At the inference time I am taking care of same preprocessing which one used in keras inference. So this is tflite inference code

# load tflite model
babyNet_lite = tf.lite.Interpreter(model_path=model_path)
# allocate tensors
babyNet_lite.allocate_tensors()

input_details = babyNet_lite.get_input_details()
output_details = babyNet_lite.get_output_details()

# image reading
img = cv2.imread("test.jpg")
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (416, 416))
img = img.astype(np.float32) / 255.
img = np.expand_dims(img, axis=0)

babynet.set_tensor(input_details[0]['index'], img)
# run the inference
babynet.invoke()
# output data
outs = []
outs.append(babynet.get_tensor(output_details[0]['index']))
outs.append(babynet.get_tensor(output_details[1]['index']))
outs.append(babynet.get_tensor(output_details[2]['index']))

I am getting the accurate results with tflite. But it is taking very long time to process 1 frame. In keras model inference time is 1.0110 second per frame. But now in tflite inference it is 7.560 second per frame.

After that I quantized the model to float16 with this code.

# float16
converter.optimizations = [tf.lite.Optimize.DEFAULT]
converter.target_spec.supported_types = [tf.float16]
tflite_fp16_model = converter.convert()
tflite_model_fp16_file = "model_quant_f16.tflite"
open (tflite_model_fp16_file , "wb").write(tflite_fp16_model)

Then I checked the inference time. Now it is showing around 2.100 second per frame. Model size is reduced from 256 mb to 128 mb. Accuracy is also same. But still inference time is more than keras model inference. Where I did mistake?

Not understanding where I did mistake. My keras model inference is 1 second per frame but same converted tflite model inference is 2 second per second. I am using CPU system only. Tensorflow version is 1.15.2. Keras version is 2.3.1. Not gaining any performance speed at inference time after converting to tflite.

I would assume that the Keras model is using the GPU and the TFLite is running on CPU — BCJuan, Mar 03 '21 at 14:39
I want to reproduce the issue. Where can I get the keras_model/yolo4.h5 file? — Terry Heo, Mar 04 '21 at 07:49
Yeah this is yolo4.h5 file https://drive.google.com/file/d/1mYduVQStQfQ7Y0RSrNXGWuu7_rVw4ZRu/view?usp=sharing — Pranali Kulkarni, Mar 04 '21 at 08:13
This is actual keras yolo4 model https://drive.google.com/file/d/1t_RO27KaPkSIzllEzBmlUGV8Ns2fUTNL/view?usp=sharing — Pranali Kulkarni, Mar 04 '21 at 08:15
This is keras to tflite conversion script https://drive.google.com/file/d/1CYk9Y3HI5wVutB1FWj2gUYEWUDBWCfTz/view?usp=sharing — Pranali Kulkarni, Mar 04 '21 at 08:16
It seems that you're using TF1. Could you try to use TF2 to run TFLite model? (Only for inference) In TF2, you can specify the number of threads. https://www.tensorflow.org/api_docs/python/tf/lite/Interpreter — Terry Heo, Mar 05 '21 at 04:20
Yeah I tried in TF2 also with number of threads option. I set num_threads to 4. But it giving same performance. — Pranali Kulkarni, Mar 05 '21 at 09:15
Not understanding where I did mistake. My keras model inference is 1 second per frame but same converted tflite model inference is 2 second per second — Pranali Kulkarni, Mar 05 '21 at 18:41
Could you file a ticket on https://github.com/tensorflow/tensorflow/issues ? — Terry Heo, Mar 06 '21 at 08:06
I have found this answer: https://stackoverflow.com/questions/54093424/why-is-tensorflow-lite-slower-than-tensorflow-on-desktop . It probably is related to the CPU instructions Tensorflow was built for — BCJuan, Mar 07 '21 at 10:01
So what I have to do to increase the inference speed at least the tflite inference speed equal to keras inference. Now keras inference speed is twice than tflite inference — Pranali Kulkarni, Mar 07 '21 at 15:07

score 0 · Answer 1 · answered Jul 28 '22 at 12:46

I know this is not a direct answer to your question but if you're looking for a faster way to infer, I'd recommend trying OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes the inference performance by e.g. graph pruning or fusing some operations together. Here are the performance benchmarks for Keras/Tensorflow models.

You can find a full tutorial on how to convert the Keras model here. Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use [this tool][4] to find the best way in your case.

pip install openvino-dev[tensorflow2]

Save your model as SavedModel

OpenVINO is not able to convert the HDF5 model, so you have to save it as SavedModel first.

import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

Tflite inference is very slower than keras model inference

1 Answers1