I'm using GPT-J locally on a Nvidia RTX 3090 GPU. Currently, I'm using the model in the following way:
config = transformers.GPTJConfig.from_pretrained("EleutherAI/gpt-j-6B")
tokenizer = transformers.AutoTokenizer.from_pretrained("EleutherAI/gpt-j-6B", pad_token='<|endoftext|>', eos_token='<|endoftext|>', truncation_side='left')
model = GPTJForCausalLM.from_pretrained(
"EleutherAI/gpt-j-6B",
revision="float16",
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
use_cache=True,
gradient_checkpointing=True,
)
model.to('cuda')
prompt = self.tokenizer(text, return_tensors='pt', truncation=True, max_length=2048)
prompt = {key: value.to('cuda') for key, value in prompt.items()}
out = model.generate(**prompt,
n=1,
min_length=16,
max_new_tokens=75,
do_sample=True,
top_k=35,
top_p=0.9,
batch_size=512,
temperature=0.75,
no_repeat_ngram_size=4,
clean_up_tokenization_spaces=True,
use_cache=True,
pad_token_id=tokenizer.eos_token_id
)
res = tokenizer.decode(out[0])
As input to the model I'm using 2048 tokens and I produce 75 tokens as output. The latency is around 4-5 seconds. In the following blog post, I've read that using pipelines
latency can be improved and that tokenization can be a bottleneck.
Can the tokenization be improved for my code and would using a pipeline reduce the latency? Are there any other things I can do to reduce the latency?