3

Is it possible to enable XLA compilation when doing inference with Tensorflow Serving?

(I am hoping it's just a matter of undocumented configs and that I can avoid implementing a custom Servable).

njs
  • 31
  • 2
  • There are mentions of XLA in tensorflow serving sources. It still depends on tensorflow, so if you compile it from sources, it fetches tensorflow sources, compiles it first, and then compiles serving. I would try just building from sources and trying an XLA optimized model with it. In case that fails, you might need to play with Bazel, so that you are in charge of the building options. – clstl Feb 13 '19 at 21:13
  • I saw XLA mentioned in the warm-up protobuf, which makes sense since you'd want the JIT to be done before serving production traffic. XLA ahead-of-time compilation is only for mobile targets as I understand. For normal GPU XLA acceleration you need to turn it on using a TF session ConfigProto (graph_options.optimizer_options.global_jit_level), but in the case of Tensorflow Serving I'm only handing in a frozen graph def. I don't have access to the session inside the box. – njs Feb 14 '19 at 07:47
  • have you figured it out? – Andy Cheung Feb 25 '19 at 06:44
  • Nope. I ended up switching to Nvidia's TensorRT Inference Server instead. – njs Feb 25 '19 at 08:02

1 Answers1

1

@njs,

It is actually not suggested doing compilations during inference. Compilations at inference time will cause HBM to be out of memory, resulting the chips to be unable to serve requests.

The recommended solution is to:

  1. Use batch-function with allowed batch sizes to restrict the number of compilations at run time.

  2. Do all compilations for these allowed batch sizes at model load time instead of inference time. This way your model is ready for inference right after load, rather than going through high latency compilations at inference time.

RakTheGeek
  • 405
  • 1
  • 5
  • 13