I have an endpoint in us-east
which serves a custom imported model (docker image).
This endpoint uses min replicas = 1
and max replicas = 100
.
Sometimes, Vertex AI will require the model to scale from 1 to 2.
However, there seems to be an issue causing the number of replicas to go from 1 -> 0 -> 2
instead of 1 -> 2
.
This causes several 504 (Gateway Timeout) errors in my API and the way to solve that was setting min replicas > 1
, highly impacting the monthly cost of the application.
Is this some known issue to Vertex AI/GCP services, is there anyway to fix it?