Currently, I am working with a PyTorch model locally using the following code:
from transformers import pipeline
classify_model = pipeline("zero-shot-classification", model='models/zero_shot_4.7.0', device=device)
result = classify_model(text, [label], hypothesis_template=hypothesis)
score = result.scores[0]
I have decided to try deploying this model using TorchServe on Vertex AI, using google documentation, but I have some concerns. For example, the MAR archive essentially just contains my models and tokenizer, and it unpacks every time the container starts, creating a new folder each time and taking up more space. By default, TorchServe loads 5 workers, each of which loads a 2 GB model into memory, totaling 10 GB of RAM. I haven't delved too deeply into it yet, but I believe load balancing is the responsibility of Vertex AI. Please correct me if I'm wrong. Would it be better to create a simple Flask + PyTorch + Transformers container based on an NVIDIA/CUDA image and use it for production? Or do I still need to use TorchServe? In the future, the system should automatically scale and have the tools to handle a hiload. Perhaps there are other approaches in my case that do not involve using a container at all.