Tensorflow Serving: Batching requests gets higher latency

Question

I am experimenting with deploying TF Serving on GKE and trying to make a highly available online prediction system. I was trying to optimize latency by batching multiple requests together. However, the latency seems to suffer rather than improve.

The model is a CNN, input vector of length around 50.
TF Serving runs on a Kubernetes cluster with 6 standard nodes
I tried batches of size 5 and 10. I didn't use the batching implementation from TF Serving, I simply sent a request with an array of shape (batch_size, input_size) instead of (1, input_size)

My intuition was that even though batching brings most benefit when used with GPUs to use their throughput, using it with CPUs shouldn't make it slower. The slowdown is illustrated in the charts bellow - req/s are rather predictions/s i.e. 20 would be split into either 4 or 2 requests to the server.

I understand how this doesn't spread the workload evenly over the cluster for a smaller number of requests - but even when looking at 60 or 120 the latency is just higher.

Any idea why that is the case?

chart with batch size 1

chart with batch size 5

chart with batch size 10

How big is the data you sent with your batch? I saw some pathological cases when sending data bottlenecked at 80MB/second, perhaps this could be the bottleneck — Yaroslav Bulatov, Oct 18 '16 at 18:22
Hmm it's nowhere near that size - the whole batch is only an array 10*50 of integers. — Robert Lacok, Oct 19 '16 at 12:41
If you are running it on CPU, batching shouldn't be slower, but also not faster. — patapouf_ai, Feb 07 '17 at 11:42
I don't you say you dont use the batching implementation but feed in (batch, inp). If it considers this as one input, it is possible that this increases the number of parameters you have in your model considerably, because you now treat each sample in the batch differently. That would be a reason for a considerable slowdown. — patapouf_ai, Feb 07 '17 at 11:44
If it is latency time. It is possible that that increases due to batching because you need to wait for a whole batch to be made before you can send it to processing. — patapouf_ai, Feb 07 '17 at 11:46
Hi ─@RobertLacok did you solved the issue? I'm having the same problem and haven't find a solution. — Ismael, Apr 20 '17 at 17:11
Could you share any example of batching? like docs, examples other than [this](https://github.com/tensorflow/serving/tree/master/tensorflow_serving/batching) — Sathyamoorthy R, Mar 12 '19 at 12:44

Tensorflow Serving: Batching requests gets higher latency

0 Answers0