Vertex AI Endpoints scales to 0 before increasing number of replicas

Question

I have an endpoint in us-east which serves a custom imported model (docker image).

This endpoint uses min replicas = 1 and max replicas = 100.

Sometimes, Vertex AI will require the model to scale from 1 to 2.

However, there seems to be an issue causing the number of replicas to go from 1 -> 0 -> 2 instead of 1 -> 2.

This causes several 504 (Gateway Timeout) errors in my API and the way to solve that was setting min replicas > 1, highly impacting the monthly cost of the application.

Is this some known issue to Vertex AI/GCP services, is there anyway to fix it?

Kabilan Mohanraj · Accepted Answer · 2022-01-03T10:11:36.607

The intermittent 504 errors could be a result of an endpoint that is under-provisioned to handle the load. It can also happen if too many prediction requests are sent to the endpoint before it has a chance to scale up.

Traffic splitting of the incoming prediction requests is done randomly. So, multiple requests may end up on the same model server at the same time. This can happen even if the overall Queries Per Second (QPS) is low, and especially when the QPS is spiky. This contributes to the requests being queued up if the model server isn't able to handle the load. This is what results in a 504 error.

Recommendations to mitigate the 504 errors are as follows:

Improve the container's ability to use all resources in the container. One thing to keep in mind about resource utilization is whether the model server is single-threaded or multi-threaded. The container may not be using up all the cores and/or requests may be queuing up, hence are served only one-at-a-time.
Autoscaling is happening, it just might need to be tuned to the prediction workload and expectations. A lower utilization threshold would trigger autoscaling sooner.
Perform an exponential backoff while the deployment is scaling. This way, there is a retry mechanism to handle failed requests.
Provision a higher minimum replica count for the endpoint, which you have already implemented.

If the above recommendations do not solve the problem or in general require further investigation of these errors, please reach out to GCP support in case you have a support plan. Otherwise, please open an issue in the issue tracker.

The issue was fixed by setting up a retrying with an exponential backoff strategy as you suggested. Thank you very much for your answer. I'll set it as the accepted answer for future adventurers having this issue. — Victor Maricato, Jan 04 '22 at 12:47
Though it would be nice if the error could be tracked from the vertex side rather then then 504 client/appengine side — jonincanada, Jan 11 '23 at 15:36

Vertex AI Endpoints scales to 0 before increasing number of replicas

1 Answers1