I have a small web server that gets input in terms of sentences and needs to return a model prediction using Tensorflow Serving. It's working all fine and well using our single GPU, but now I'd like to enable batching such that Tensorflow Serving waits a bit to group incoming sentences before processing them together in one batch on the GPU.
I'm using the predesigned server framework with the predesigned batching framework using the initial release of Tensorflow Serving. I'm enabling batching using the --batching
flag and have set batch_timeout_micros = 10000
and max_batch_size = 1000
. The logging does confirm that batching is enabled and that the GPU is being used.
However, when sending requests to the serving server the batching has minimal effect. Sending 50 requests at the same time almost linearly scales in terms of time usage with sending 5 requests. Interestingly, the predict()
function of the server is run once for each request (see here), which suggests to me that the batching is not being handled properly.
Am I missing something? How do I check what's wrong with the batching?
Note that this is different from How to do batching in Tensorflow Serving? as that question only examines how to send multiple requests from a single client, but not how to enable Tensorflow Serving's behind-the-scenes batching for multiple separate requests.