Reducing Latency for GPT-J

Question

I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way:

config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token='<|endoftext|>', eos_token='<|endoftext|>', truncation_side='left')
model = GPTJForCausalLM.from_pretrained(
     "EleutherAI/gpt-j-6B",
      revision="float16",
      torch_dtype=torch.float16,
      low_cpu_mem_usage=True,
      use_cache=True,
      gradient_checkpointing=True,
 )
model.to('cuda')
prompt = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
prompt = {key: value.to('cuda') for key, value in prompt.items()}
out = model.generate(**prompt,
     n=1,
     min_length=16,
     max_new_tokens=75,
     do_sample=True,
     top_k=35,
     top_p=0.9,
     batch_size=512,
     temperature=0.75,
     no_repeat_ngram_size=4,
     clean_up_tokenization_spaces=True,
     use_cache=True,
     pad_token_id=tokenizer.eos_token_id
 )
res = tokenizer.decode(out[0])

As input to the model I'm using 2048 tokens and I produce 75 tokens as output. The latency is around 4-5 seconds. In the following blog post, I've read that using pipelines latency can be improved and that tokenization can be a bottleneck.

Can the tokenization be improved for my code and would using a pipeline reduce the latency? Are there any other things I can do to reduce the latency?

Reducing Latency for GPT-J

0 Answers0