1

Our use case is as follows: We have multiple custom trained models (in the hundreds, and the number increases as we allow the user of our application to create models through the UI, which we then train and deploy on the fly) and so deploying each model to a separate endpoint is expensive as Vertex AI charges per node used. Based on the documentation it seems that we can deploy models of different types to the same endpoint but I am not sure how that would work. Let's say I have 2 different custom trained models deployed using custom containers for prediction to the same endpoint. Also, say I specify the traffic split to be 50% for the two models. Now, how do I send a request to a specific model? Using the python SDK, we make calls to the endpoint, like so:

from google.cloud import aiplatform
endpoint = aiplatform.Endpoint(endpoint_id)
prediction = endpoint.predict(instances=instances)

# where endpoint_id is the id of the endpoint and instances are the observations for which a prediction is required

My understanding is that in this scenario, vertex AI will route some calls to one model and some to the other based on the traffic split. I could use the parameters field, as specified in the docs, to specify the model and then process the request accordingly in the custom prediction container, but still some calls will end up going to a model which it will not be able to process (because Vertex AI is not going to be sending all requests to all models, otherwise the traffic split wouldn't make sense). How do I then deploy multiple models to the same endpoint and make sure that every prediction request is guaranteed to be served?

racerX
  • 930
  • 9
  • 25
  • What's your latency requirement? How many request can you have? – guillaume blaquiere Nov 08 '21 at 08:24
  • It varies, for some it can be in the order of minutes, for some, it is seconds, so we cannot just get away with batch prediction. The number of requests would ultimately vary from user to user. – racerX Nov 09 '21 at 03:04
  • How many mode do you have? More than 100? And you want to serve them with Vertex AI custom model serving, correct? – guillaume blaquiere Nov 09 '21 at 08:10
  • Yes, more than 100 models, all to be served with custom model serving. – racerX Nov 11 '21 at 19:53
  • You can't, the quota block you to 100 models in the same project. – guillaume blaquiere Nov 11 '21 at 22:31
  • Hi @guillaumeblaquiere , is this quota for all models or for custom trained models only? Also, is it not possible to request an increase in the quota? Can you please share the source – racerX Nov 15 '21 at 20:23
  • 1
    [Here](https://cloud.google.com/vertex-ai/quotas#serving) the source. I don't know if it's a soft or hard limit; But we never was able to increased it in my previous company and I wrote that Cloud Run packaging: https://medium.com/google-cloud/on-demand-small-batch-predictions-with-cloud-run-and-embedded-tf-469242d66c3b – guillaume blaquiere Nov 15 '21 at 20:52
  • 1
    Thanks, the medium article is an interesting read. I was considering a workaround using the custom prediction container by Vertex AI itself, see the answer, and my comment to it below. Cloud run has resource limitations as you mentioned in the article which Vertex AI does not. However, Cloud run can scale down to 0 (Vertex AI cannot). Pricing seems to be higher for cloud run as well. – racerX Nov 15 '21 at 22:13
  • 1
    Sure there are pros and cons. If your model is small, use Cloud Run. Use vertex AI only for the biggest model that requires a lot of resources. I think the Cloud Run team is also working on a Cloud Run with GPU, it could be a solution. – guillaume blaquiere Nov 16 '21 at 08:12

1 Answers1

1

This documentation talks about a use case where 2 models are trained on the same feature set and are sharing the ingress prediction traffic. As you have understood correctly, this does not apply to models that have been trained on different feature sets, that is, different models.

Unfortunately, deploying different models to the same endpoint utilizing only one node is not possible in Vertex AI at the moment. There is an ongoing feature request that is being worked on. However, we cannot provide an exact ETA on when that feature will be available.

I reproduced the multi-model setup and noticed the below points.

Traffic Splitting

I deployed 2 different models to the same endpoint and sent predictions to it. I set a 50-50 traffic splitting rule and saw errors that implied requests being sent to the wrong model.

Cost Optimization

When multiple models are deployed to the same endpoint, they are deployed to separate, independent nodes. So, you will still be charged for each node used. Also, node autoscaling happens at the model level, not at the endpoint level.

A plausible workaround would be to pack all your models into a single container and use a custom HTTP server logic to send prediction requests to the appropriate model. This could be achieved using the parameters field of the prediction request body. The custom logic would look something like this.

@app.post(os.environ['AIP_PREDICT_ROUTE'])
async def predict(request: Request):
    body = await request.json()
    parameters = body["parameters"]
    instances = body["instances"]
    inputs = np.asarray(instances)
    preprocessed_inputs = _preprocessor.preprocess(inputs)

    if(parameters["model_name"]=="random_forest"):
        print(parameters["model_name"])
        outputs = _random_forest_model.predict(preprocessed_inputs)
    else:
        print(parameters["model_name"])
        outputs = _decision_tree_model.predict(inputs)

    return {"predictions": [_class_names[class_num] for class_num in outputs]}
Kabilan Mohanraj
  • 1,856
  • 1
  • 7
  • 17
  • 1
    Ok, so each model, whether same type or different, is deployed to its own independent node and scales independently, that would indeed be impractical to use. Yes, I have already been using something similar to what you mentioned, except that instead of storing models in the container, I store them on cloud storage, and then deploy a single "dummy" model using a custom prediction container. When I need predictions, I pass relevant model information in the parameters field. If I need a new model, I simply store it in cloud storage, that way, I don't have to rebuild the container for prediction. – racerX Nov 09 '21 at 03:19
  • Hi @racerX. Thank you for sharing your workaround. If my answer addressed your question, please consider accepting it. If not, let me know so that I can improve the answer. – Kabilan Mohanraj Nov 10 '21 at 17:44
  • @racerX But I assume there will be a step to fetch the model from GCS when you pass a new model as a parameter. Would it be once then cached, or when would the download step happen? – OmaymaS Dec 01 '21 at 16:23
  • 1
    You could simply fetch the model artifacts from gcs every single time a prediction is required. Not sure if that causes performance issues, but if it does, you can copy it to the container storage itself. Then the next time, you can first check if the model is available in the container or not. You could also cache all models in a dictionary in memory and only fetch (from gcs or container storage) if the model is not available there. – racerX Dec 01 '21 at 17:47
  • 1
    Yeah this is exactly the case. I tried to fetch it without caching and it caused performance issues. That's why I was asking. But maybe I'll need to consider caching making them available in the container from the very beginning. Thanks. – OmaymaS Dec 02 '21 at 12:09