I'm using huggingface transformer gpt-xl model to generate multiple responses. I'm trying to run it on multiple gpus because gpu memory maxes out with multiple larger responses. I've tried using dataparallel to do this but, looking at nvidia-smi it does not appear that the 2nd gpu is ever used. Here's my code:
import numpy as np
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import torch
n_gpu=torch.cuda.device_count()
#device = xm.xla_device()
device=torch.device("cuda:0")
tokenizer = GPT2Tokenizer.from_pretrained('/spell/GPT2Model/GPT2Model/') #downloaded pre-trained model and tokenizer earlier
model = GPT2LMHeadModel.from_pretrained('/spell/GPT2Model/GPT2Model/')
model.to(device)
model = torch.nn.DataParallel(model, device_ids=[0,1])
encoded_prompt=tokenizer.encode(prompt_text, add_special_tokens=True,return_tensors="pt")
encoded_prompt = encoded_prompt.to(device)
outputs = model.module.generate(encoded_prompt,response_length,temperature=.8,num_return_sequences=num_of_responses,repetition_penalty=85,do_sample=True,top_k=80,top_p=.85 )
program gets oom on dual T4, memory of 2nd gpu never goes above 11M.