My question is related to how one deploys the Hugging Face model. I recently downloaded the Falcon 7B Instruct model and ran it in my Colab. However, when I am trying to load the model and want it to generate text, it takes about 40 seconds to give me an output. I was just wondering how we deploy these models in production then so that it gives us output with low latency. I am new to MLOps so I just want to explore. Also, what will be the charges of deploying that model? What if many users are simultaneously using this model? How will I handle that? Will greatly appreciate the response.
The code I am using is from the https://huggingface.co/tiiuae/falcon-7b-instruct.
Also, I am saving the model weights locally in a Google Drive.