1

I noticed that removing this line:

model = prepare_model_for_int8_training(model)

causes the model to produce a loss value of "nan" easily if I load model in 8bit. Can someone explain the necessity of this function?


Furthermore, the purpose of this block is also hard to understand:

if loaded_in_kbit and use_gradient_checkpointing:
        # For backward compatibility
        if hasattr(model, "enable_input_require_grads"):
            model.enable_input_require_grads()
        else:

            def make_inputs_require_grad(module, input, output):
                output.requires_grad_(True)

            model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)

        # enable gradient checkpointing for memory efficiency
        model.gradient_checkpointing_enable()

Is this code necessary for finetuning even though I do not want to change the parameters of LLama but add a new component?


I understand this may be a basic question for many, but I couldn't find a clear answer.

I found calling prepare_model_for_int8_training consumes more CUDA memory that may lead to out of memory.

ysngki
  • 11
  • 2

0 Answers0