Hugging Face model deployment

Question

My question is related to how one deploys the Hugging Face model. I recently downloaded the Falcon 7B Instruct model and ran it in my Colab. However, when I am trying to load the model and want it to generate text, it takes about 40 seconds to give me an output. I was just wondering how we deploy these models in production then so that it gives us output with low latency. I am new to MLOps so I just want to explore. Also, what will be the charges of deploying that model? What if many users are simultaneously using this model? How will I handle that? Will greatly appreciate the response.

The code I am using is from the https://huggingface.co/tiiuae/falcon-7b-instruct.

Also, I am saving the model weights locally in a Google Drive.

score 1 · Answer 1 · answered Jul 20 '23 at 07:29

I was just wondering how we deploy these models in production then so that it gives us output with low latency.

You can download the model and use it locally, to avoid any kind of latency related to Internet connection. Notice that you input has to be processed, so it is normal to take some time to give you a response. To make it as quick as possible, you have to run it on GPUs (typically dozens of times faster that CPUs).

Also, what will be the charges of deploying that model?

You can use the models for free.

What if many users are simultaneously using this model?

I have never experienced issues of this kind, but if you want to avoid any kind of problem, once again, you can download the model and load it from local.

Thank you for the answer. I am just more curious about its application in a production setup. — Usman Afridi, Jul 20 '23 at 07:59

Hugging Face model deployment

1 Answers1