I noticed that removing this line:
model = prepare_model_for_int8_training(model)
causes the model to produce a loss value of "nan" easily if I load model in 8bit. Can someone explain the necessity of this function?
Furthermore, the purpose of this block is also hard to understand:
if loaded_in_kbit and use_gradient_checkpointing:
# For backward compatibility
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
# enable gradient checkpointing for memory efficiency
model.gradient_checkpointing_enable()
Is this code necessary for finetuning even though I do not want to change the parameters of LLama but add a new component?
I understand this may be a basic question for many, but I couldn't find a clear answer.
I found calling prepare_model_for_int8_training consumes more CUDA memory that may lead to out of memory.