Is there a suggested way to serve hundreds of machine learning models in Kubernetes? Solutions like Kfserving seem to be more suitable for cases where there is a single trained model, or a few versions of it, and this model serves all requests. For instance a typeahead model that is universal across all users.
But is there a suggested way to serve hundreds or thousands of such models? For example, a typeahead model trained specifically on each user's data.
The most naive way to achieve something like that, would be that each typeahead serving container maintains a local cache of models in memory. But then scaling to multiple pods would be a problem because each cache is local to the pod. So each request would need to get routed to the correct pod that has loaded the model.
Also having to maintain such a registry where we know which pod has loaded which model and perform updates on model eviction seems like a lot of work.