Why is TF Keras inference way slower than Numpy operations?

Question

I'm working on a reinforcement learning model implemented with Keras and Tensorflow. I have to do frequent calls to model.predict() on single inputs.

While testing inference on a simple pretrained model, I noticed that using Keras' model.predict is WAY slower than just using Numpy on stored weights. Why is it that slow and how can I accelerate it? Using pure Numpy is not viable for complex models.

import timeit
import numpy as np
from tensorflow.python.keras.models import Sequential
from tensorflow.python.keras.layers import Dense

w = np.array([[-1., 1., 0., 0.], [0., 0., -1., 1.]]).T
b = np.array([ 15., -15., -21., 21.])

model = Sequential()
model.add(Dense(4, input_dim=2, activation='linear'))
model.layers[0].set_weights([w.T, b])
model.compile(loss='mse', optimizer='adam')

state = np.array([-23.5, 17.8])

def predict_very_slow():
    return model.predict(state[np.newaxis])[0]

def predict_slow():
    ws = model.layers[0].get_weights()
    return np.matmul(ws[0].T, state) + ws[1]

def predict_fast():
    return np.matmul(w, state) + b

print(
    timeit.timeit(predict_very_slow, number=10000),
    timeit.timeit(predict_slow, number=10000),
    timeit.timeit(predict_fast, number=10000)
)
# 5.168972805004538 1.6963867129435828 0.021918574168087623
# 5.461319456664639 1.5491559107269515 0.021502970783442876

An interesting insight. Please provide more information (maybe more code?) of your training/inference, as well as model definition. — Chan Kha Vu, Feb 14 '18 at 23:54
If I use model.predict() on an array with 10000 lines for 1 time, it takes about 0.1 seconds. — alexander, Feb 15 '18 at 00:54
Same boat, seems tf or keras has some start up bloat. Calling a large batch runs almost same time as one sample. Considering mappy some layers to numpy. — mathtick, Oct 21 '19 at 21:43

score 12 · Answer 1 · edited Mar 24 '19 at 20:48

12

A little late, but maybe useful for someone:

Replace model.predict(X) with model.predict(X, batch_size=len(X))

That should do it.

edited Mar 24 '19 at 20:48

Sterling Archer

22,070
18
81
118

answered Mar 24 '19 at 20:27

Simon Schmickler

161
2
8

1

This saves a tonne of time. Only catch is to make sure that `batch_size` is not very large or else tensorflow will throw an `OOM`. – Ruthvik Vaila Dec 14 '19 at 19:24
1

This worked for me when some of the other more complicated solutions like using a compiled/uncompiled model did not. – tim654321 Mar 26 '20 at 10:39

score 4 · Answer 2 · answered Feb 15 '18 at 00:14

4

Are you running your Keras model (with TensorFlow backend) in a loop? If so, Keras has a memory leak issue identified here: LINK

In this case you have to import the following:

import keras.backend.tensorflow_backend
import tensorflow as tf

from keras.backend import clear_session

Finally, you have to put the following at the end of every iteration of a loop after you're done doing your computations:

clear_session()
if keras.backend.tensorflow_backend._SESSION:
    tf.reset_default_graph()
    keras.backend.tensorflow_backend._SESSION.close()
    keras.backend.tensorflow_backend._SESSION = None

This should help you free up memory at the end of every loop and eventually, make the process faster. I hope this helps.

answered Feb 15 '18 at 00:14

troymyname00

670
1
14
32

1

As I understand it, the memory leak is only a problem if you create models in a loop. I only use a single model object in the test case. – alexander Feb 15 '18 at 00:46
1

@schoeberl Understood. There might be several ways to make a single model faster, and that would depend on how you're setting up the model. A quick search led me to this, which I thought might be helpful: [LINK](https://stackoverflow.com/questions/42184863/how-do-you-make-tensorflow-keras-fast-with-a-tfrecord-dataset) – troymyname00 Feb 16 '18 at 00:58

score 0 · Answer 3 · answered Oct 24 '19 at 13:31

0

The memory leak issue still seems to persist in Keras. The following lines of code mentioned in that issue did the trick for me:

import ... as K
import gc

model = ....
del model
K.clear_session()
gc.collect()

answered Oct 24 '19 at 13:31

Hagbard

3,430
5
28
64

`import ... as K` yields a `SyntaxError: invalid syntax` for me. – starbeamrainbowlabs Sep 06 '22 at 16:16
What happens if you write "import keras as K" instead? The important part of this answer is the proper disposal of the model via the garbage collector. – Hagbard Sep 07 '22 at 08:11

score -1 · Answer 4 · answered Aug 03 '22 at 13:49

If you prefer to stay with the network instead of numpy calculations you could try OpenVINO. OpenVINO is optimized for Intel hardware but it should work with any CPU. It optimizes your model by converting to Intermediate Representation (IR), performing graph pruning and fusing some operations into others while preserving accuracy. Then it uses vectorization in runtime. Better performance should be especially visible with larger networks.

It's rather straightforward to convert the Keras model to OpenVINO. The full tutorial on how to do it can be found here. Some snippets below.

Install OpenVINO

The easiest way to do it is using PIP. Alternatively, you can use this tool to find the best way in your case.

pip install openvino-dev[tensorflow2]

Save your model as SavedModel

OpenVINO is not able to convert HDF5 model, so you have to save it as SavedModel first.

import tensorflow as tf
from custom_layer import CustomLayer
model = tf.keras.models.load_model('model.h5', custom_objects={'CustomLayer': CustomLayer})
tf.saved_model.save(model, 'model')

Use Model Optimizer to convert SavedModel model

The Model Optimizer is a command-line tool that comes from OpenVINO Development Package. It converts the Tensorflow model to IR, which is a default format for OpenVINO. You can also try the precision of FP16, which should give you better performance without a significant accuracy drop (just change data_type). Run in the command line:

mo --saved_model_dir "model" --input_shape "[1, 3, 224, 224]" --data_type FP32 --output_dir "model_ir"

Run the inference

The converted model can be loaded by the runtime and compiled for a specific device e.g. CPU or GPU (integrated into your CPU like Intel HD Graphics). If you don't know what is the best choice for you, just use AUTO.

# Load the network
ie = Core()
model_ir = ie.read_model(model="model_ir/model.xml")
compiled_model_ir = ie.compile_model(model=model_ir, device_name="CPU")

# Get output layer
output_layer_ir = compiled_model_ir.output(0)

# Run inference on the input image
result = compiled_model_ir([input_image])[output_layer_ir]

Disclaimer: I work on OpenVINO.

Why is TF Keras inference way slower than Numpy operations?

4 Answers4