Getting RuntimeError: expected scalar type Half but found Float in AWS P3 instances in opt6.7B fine tune

Question

I have a simple code which takes a opt6.7B model and fine tunes it. When I run this code in Google colab(Tesla T4, 16GB) it runs without any problem. But when I try to run the the same code in AWS p3-2xlarge environment (Tesla V100 GPU, 16GB) it gives the error.

RuntimeError: expected scalar type Half but found Float

To be able to run the fine tuning on a single GPU I use LORA and peft. which are installed exactly the same way (pip install) in both cases. I can use with torch.autocast("cuda"): and then that error vanishes. But the loss of the training becomes very strange meaning it does not gradually decrease rather it fluctuates within a large range (0-5) (and if I change the model to GPT-J then the loss always stays 0) whereas the loss is gradually decreasing for the case of colab. So I am not sure if using with torch.autocast("cuda"): is a good thing or not.

The transfromeers version is 4.28.0.dev0 in both case. Torch version for colab shows 1.13.1+cu116 whereas for p3 shows - 1.13.1 (does this mean it does not have CUDA support? I doubt, on top of that doing torch.cuda.is_available() shows True)

The only large difference I can see is that for colab, bitsandbytes has this following setup log

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.5
CUDA SETUP: Detected CUDA version 118

Whereas for p3 it is the following

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /opt/conda/envs/pytorch/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 7.0
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so...

What am I missing? I am not posting the code here. But it is really a very basic version that takes opt-6.7b and fine tunes it on alpaca dataset using LORA and peft.

Why does it run in colab but not in p3? Any help is welcome :)

-------------------- EDIT

I am posting a minimal code example that I actually tried

import os
os.environ["CUDA_VISIBLE_DEVICES"]="0"
import torch
import torch.nn as nn
import bitsandbytes as bnb
from transformers import AutoTokenizer, AutoConfig, AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-6.7b", 
    load_in_8bit=True, 
    device_map='auto',
)

tokenizer = AutoTokenizer.from_pretrained("facebook/opt-6.7b")
for param in model.parameters():
  param.requires_grad = False  # freeze the model - train adapters later
  if param.ndim == 1:
    # cast the small parameters (e.g. layernorm) to fp32 for stability
    param.data = param.data.to(torch.float32)

model.gradient_checkpointing_enable()  # reduce number of stored activations
model.enable_input_require_grads()

class CastOutputToFloat(nn.Sequential):
  def forward(self, x): return super().forward(x).to(torch.float32)
model.lm_head = CastOutputToFloat(model.lm_head)

def print_trainable_parameters(model):
    """
    Prints the number of trainable parameters in the model.
    """
    trainable_params = 0
    all_param = 0
    for _, param in model.named_parameters():
        all_param += param.numel()
        if param.requires_grad:
            trainable_params += param.numel()
    print(
        f"trainable params: {trainable_params} || all params: {all_param} || trainable%: {100 * trainable_params / all_param}"
    )

from peft import LoraConfig, get_peft_model 

config = LoraConfig(
    r=16,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM"
)

model = get_peft_model(model, config)
print_trainable_parameters(model)

import transformers
from datasets import load_dataset

tokenizer.pad_token_id = 0
CUTOFF_LEN = 256

data = load_dataset("tatsu-lab/alpaca")

data = data.shuffle().map(
    lambda data_point: tokenizer(
        data_point['text'],
        truncation=True,
        max_length=CUTOFF_LEN,
        padding="max_length",
    ),
    batched=True
)
# data = load_dataset("Abirate/english_quotes")
# data = data.map(lambda samples: tokenizer(samples['quote']), batched=True)

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=400, 
        learning_rate=2e-5, 
        fp16=True,
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

model.config.use_cache = False  # silence the warnings. Please re-enable for inference!
trainer.train()

And here is the full stack trace

/tmp/ipykernel_24622/2601578793.py:2 in <module>                                                 │
│                                                                                                  │
│ [Errno 2] No such file or directory: '/tmp/ipykernel_24622/2601578793.py'                        │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1639 in train        │
│                                                                                                  │
│   1636 │   │   inner_training_loop = find_executable_batch_size(                                 │
│   1637 │   │   │   self._inner_training_loop, self._train_batch_size, args.auto_find_batch_size  │
│   1638 │   │   )                                                                                 │
│ ❱ 1639 │   │   return inner_training_loop(                                                       │
│   1640 │   │   │   args=args,                                                                    │
│   1641 │   │   │   resume_from_checkpoint=resume_from_checkpoint,                                │
│   1642 │   │   │   trial=trial,                                                                  │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:1906 in              │
│ _inner_training_loop                                                                             │
│                                                                                                  │
│   1903 │   │   │   │   │   with model.no_sync():                                                 │
│   1904 │   │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                  │
│   1905 │   │   │   │   else:                                                                     │
│ ❱ 1906 │   │   │   │   │   tr_loss_step = self.training_step(model, inputs)                      │
│   1907 │   │   │   │                                                                             │
│   1908 │   │   │   │   if (                                                                      │
│   1909 │   │   │   │   │   args.logging_nan_inf_filter                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/transformers/trainer.py:2662 in              │
│ training_step                                                                                    │
│                                                                                                  │
│   2659 │   │   │   loss = loss / self.args.gradient_accumulation_steps                           │
│   2660 │   │                                                                                     │
│   2661 │   │   if self.do_grad_scaling:                                                          │
│ ❱ 2662 │   │   │   self.scaler.scale(loss).backward()                                            │
│   2663 │   │   elif self.use_apex:                                                               │
│   2664 │   │   │   with amp.scale_loss(loss, self.optimizer) as scaled_loss:                     │
│   2665 │   │   │   │   scaled_loss.backward()                                                    │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/_tensor.py:488 in backward             │
│                                                                                                  │
│    485 │   │   │   │   create_graph=create_graph,                                                │
│    486 │   │   │   │   inputs=inputs,                                                            │
│    487 │   │   │   )                                                                             │
│ ❱  488 │   │   torch.autograd.backward(                                                          │
│    489 │   │   │   self, gradient, retain_graph, create_graph, inputs=inputs                     │
│    490 │   │   )                                                                                 │
│    491                                                                                           │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/utils/checkpoint.py:157 in backward    │
│                                                                                                  │
│   154 │   │   │   raise RuntimeError(                                                            │
│   155 │   │   │   │   "none of output has requires_grad=True,"                                   │
│   156 │   │   │   │   " this checkpoint() is not necessary")                                     │
│ ❱ 157 │   │   torch.autograd.backward(outputs_with_grad, args_with_grad)                         │
│   158 │   │   grads = tuple(inp.grad if isinstance(inp, torch.Tensor) else None                  │
│   159 │   │   │   │   │     for inp in detached_inputs)                                          │
│   160                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/__init__.py:197 in backward   │
│                                                                                                  │
│   194 │   # The reason we repeat same the comment below is that                                  │
│   195 │   # some Python versions print out the first line of a multi-line function               │
│   196 │   # calls in the traceback and some print out the last line                              │
│ ❱ 197 │   Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the bac   │
│   198 │   │   tensors, grad_tensors_, retain_graph, create_graph, inputs,                        │
│   199 │   │   allow_unreachable=True, accumulate_grad=True)  # Calls into the C++ engine to ru   │
│   200                                                                                            │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/torch/autograd/function.py:267 in apply      │
│                                                                                                  │
│   264 │   │   │   │   │   │   │      "Function is not allowed. You should only implement one "   │
│   265 │   │   │   │   │   │   │      "of them.")                                                 │
│   266 │   │   user_fn = vjp_fn if vjp_fn is not Function.vjp else backward_fn                    │
│ ❱ 267 │   │   return user_fn(self, *args)                                                        │
│   268 │                                                                                          │
│   269 │   def apply_jvp(self, *args):                                                            │
│   270 │   │   # _forward_cls is defined by derived class                                         │
│                                                                                                  │
│ /opt/conda/envs/pytorch/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py:456 in   │
│ backward                                                                                         │
│                                                                                                  │
│   453 │   │   │                                                                                  │
│   454 │   │   │   elif state.CB is not None:                                                     │
│   455 │   │   │   │   CB = state.CB.to(ctx.dtype_A, copy=True).mul_(state.SCB.unsqueeze(1).mul   │
│ ❱ 456 │   │   │   │   grad_A = torch.matmul(grad_output, CB).view(ctx.grad_shape).to(ctx.dtype   │
│   457 │   │   │   elif state.CxB is not None:                                                    │
│   458 │   │   │   │                                                                              │
│   459 │   │   │   │   if state.tile_indices is None:

(Sorry if this is a very novice question but I have no solution at the moment :( )

It is difficult for me to answer your question without seeing the code. The error message suggests that some of your code produces float32 tensors while opt7.6b is a model with float16. Can you post the full error stacktrace and provide a minimal reproducible example? — cronoik, Apr 03 '23 at 11:12
@cronoik I thanks for the reply. I have posted a code example above. — SRC, Apr 03 '23 at 13:01
For what it is worth, I believe at least 7.5+ Compute Capability is needed for this code to work. I have managed to launch a T4 instance in GCP and install everything from scratch (including CUDA). Used Anaconda to install cutatoolkit and then install butsandbytes and all. And it worked again. If anyone is interested here is the list of compute capabilities - https://developer.nvidia.com/cuda-gpus — SRC, Apr 11 '23 at 08:58

score 1 · Answer 1 · answered Apr 06 '23 at 12:45

1

i have the same error. after search on google, finally got the solution by add the code with torch.autocast("cuda"): before my train method. like this:

 with torch.autocast("cuda"): 
    trainer.train()

answered Apr 06 '23 at 12:45

chero

13
2

I am seeing `ValueError: Attempting to unscale FP16 gradients.` Any ideas please? – Ariel Lubonja May 18 '23 at 19:44

Barata Magnus · Answer 2 · 2023-05-20T13:19:56.727

1

It could be because of mixed precision on V100 GPU. You can try to disable the fp16:

trainer = transformers.Trainer(
    model=model, 
    train_dataset=data['train'],
    args=transformers.TrainingArguments(
        per_device_train_batch_size=4, 
        gradient_accumulation_steps=4,
        warmup_steps=100, 
        max_steps=400, 
        learning_rate=2e-5, 
        fp16=False, # disable mixed precision
        logging_steps=1, 
        output_dir='outputs'
    ),
    data_collator=transformers.DataCollatorForLanguageModeling(tokenizer, mlm=False)
)

As noted in the comments below, this increases GPU memory usage because the mixed precision is disabled.

edited May 20 '23 at 13:19

answered May 09 '23 at 13:13

Barata Magnus

175
1
7

This significantly increases the amount of GPU memory you need though – Ariel Lubonja May 18 '23 at 19:43
Well, an increase in the GPU memory needed is still better than being unable to run the training. – Barata Magnus May 20 '23 at 13:17

score 0 · Answer 3 · edited May 01 '23 at 18:58

0

I have same errors with yours: when I add the blow code: with torch.autocast("cuda"): trainer.train() the loss is 0; I doubt that bitsandbytes can't support V100 while using load_int8=True and fp16=True

edited May 01 '23 at 18:58

JDOaktown

4,262
7
37
52

answered Apr 30 '23 at 04:33

moses hu

1

Getting RuntimeError: expected scalar type Half but found Float in AWS P3 instances in opt6.7B fine tune

3 Answers3