GPT-2: How do I speed up/optimize token text generation?

Question

I am trying to generate a 20 token text using GPT-2 simple. It is taking me around 15 seconds to generate the sentence. AI Dungeon is taking around 4 seconds to generate the same size sentence.

Is there a way to fasten/optimize the GPT-2 text generation?

score 0 · Answer 1 · answered Feb 16 '22 at 21:25

I think they have quicker results because their program is better optimized and they have greater computing power. They pay a lot for server. As well, Ai Dungeon uses GPT-3 which might be just faster. I'm as well struggling with speed of GPT-2. Let me know if you figured anything. Cheers

score 0 · Answer 2 · answered Mar 24 '22 at 12:38

Text generation models like GPT-2 are slow, and it is of course even worse with bigger models like GPT-J and GPT-NeoX.

If you want to speed up your text generation you have a couple of options:

Use a GPU. GPT-2 doesn't require too much VRAM so an entry level GPU will do. On a GPU, generating 20 tokens with GPT-2 shouldn't take more than 1 second.
Quantize your model and convert it to TensorRT. See this good tutorial: https://github.com/NVIDIA/TensorRT/tree/main/demo/HuggingFace/GPT2
Serve it through a dedicated inference server (like TorchServe or Triton Inference Server).

I actually wrote an article about how to speed up inference of transformer based models. You might find it helpful: how to speed up deep learning inference

score 0 · Answer 3 · answered Apr 27 '22 at 15:16

0

You can use the OpenVINO optimized version of GPT-2 model. The demo can be found here. It should be much faster as it's heavily optimized.

answered Apr 27 '22 at 15:16

dragon7

1,057
9
23

GPT-2: How do I speed up/optimize token text generation?

3 Answers3