Tensorflow Serving with XLA

Question

Is it possible to enable XLA compilation when doing inference with Tensorflow Serving?

(I am hoping it's just a matter of undocumented configs and that I can avoid implementing a custom Servable).

There are mentions of XLA in tensorflow serving sources. It still depends on tensorflow, so if you compile it from sources, it fetches tensorflow sources, compiles it first, and then compiles serving. I would try just building from sources and trying an XLA optimized model with it. In case that fails, you might need to play with Bazel, so that you are in charge of the building options. — clstl, Feb 13 '19 at 21:13
I saw XLA mentioned in the warm-up protobuf, which makes sense since you'd want the JIT to be done before serving production traffic. XLA ahead-of-time compilation is only for mobile targets as I understand. For normal GPU XLA acceleration you need to turn it on using a TF session ConfigProto (graph_options.optimizer_options.global_jit_level), but in the case of Tensorflow Serving I'm only handing in a frozen graph def. I don't have access to the session inside the box. — njs, Feb 14 '19 at 07:47
Nope. I ended up switching to Nvidia's TensorRT Inference Server instead. — njs, Feb 25 '19 at 08:02

score 1 · Answer 1 · answered Apr 17 '19 at 04:18

@njs,

It is actually not suggested doing compilations during inference. Compilations at inference time will cause HBM to be out of memory, resulting the chips to be unable to serve requests.

The recommended solution is to:

Use batch-function with allowed batch sizes to restrict the number of compilations at run time.
Do all compilations for these allowed batch sizes at model load time instead of inference time. This way your model is ready for inference right after load, rather than going through high latency compilations at inference time.

Why the out of memory for HBM not occur during training ? – Xiaolin Wu Dec 21 '19 at 02:42 — Xiaolin Wu, Dec 21 '19 at 02:42

Tensorflow Serving with XLA

1 Answers1