running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine

Question

I'm trying to run llama2 13b model with rope scaling on the AWS g4dn.12xlarge machine with has 4 gpus with 16 GB VRAM each but getting cuda out of memory error.

Code:

from transformers import AutoModelForCausalLM, AutoTokenizer,pipeline
import transformers

tokenizer = AutoTokenizer.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("TheBloke/Llama-2-13B-Chat-fp16", device_map = 'auto',
                                             **{"rope_scaling":{"factor": 2.0,"type": "linear"}}
    )

user_prompt = "..."

pipeline = transformers.pipeline(
    "text-generation",
    model=model,
    tokenizer=tokenizer,
    device_map="auto",
)


sequences = pipeline(
   user_prompt,
    max_length=8000,
    do_sample=True,
    top_k=10,
    num_return_sequences=1,
    eos_token_id=tokenizer.eos_token_id,
)

print(sequences)

This is the error I'm getting when the prompt size is greater than 4k

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 4.51 GiB (GPU 0; 14.61 GiB total capacity; 11.92 GiB already allocated; 1.76 GiB free; 12.14 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

Is 64GB not enough to run the model with 8k context or is there a bug in my code?

Well I guess the problem is that you have 4 separate 16GB VRAM and not 64GB of joint GPU memory. The model and data stored at each GPU seems to exceed 16GB — i regular, Aug 10 '23 at 06:56

running into cuda out of memory when running llama2-13b-chat model on multi-gpu machine

0 Answers0