I fine-tune the Llama-7b-hf model on downstream task and use eos_token as the pad_token When evaluating using model.generate(), this error occurs at 5th batch (the last 4 batches run without any trouble). I print tensor.shape and cuda device of the two tensors above (line 93 of modeling_llama.py -- > class LlamaRMSNorm(nn.Module): return self.weight * hidden_states), and also print the input_ids/attention_mask fed into the model, everything is alright. enter image description here
I change the batch size from 4 to 1 and it runs normally. [There will be no pad_token_id input to the model], but nothing goes wrong during fine-tuning when I set the batch size equal to 4. Besides, there are pad_token_id in the first 4 batches' input_ids. Other LLMs like Bloom, chatGLM will not meet this problem.