Our use case is as follows: We have multiple custom trained models (in the hundreds, and the number increases as we allow the user of our application to create models through the UI, which we then train and deploy on the fly) and so deploying each model to a separate endpoint is expensive as Vertex AI charges per node used. Based on the documentation it seems that we can deploy models of different types to the same endpoint but I am not sure how that would work. Let's say I have 2 different custom trained models deployed using custom containers for prediction to the same endpoint. Also, say I specify the traffic split to be 50% for the two models. Now, how do I send a request to a specific model? Using the python SDK, we make calls to the endpoint, like so:
from google.cloud import aiplatform
endpoint = aiplatform.Endpoint(endpoint_id)
prediction = endpoint.predict(instances=instances)
# where endpoint_id is the id of the endpoint and instances are the observations for which a prediction is required
My understanding is that in this scenario, vertex AI will route some calls to one model and some to the other based on the traffic split. I could use the parameters field, as specified in the docs, to specify the model and then process the request accordingly in the custom prediction container, but still some calls will end up going to a model which it will not be able to process (because Vertex AI is not going to be sending all requests to all models, otherwise the traffic split wouldn't make sense). How do I then deploy multiple models to the same endpoint and make sure that every prediction request is guaranteed to be served?