0

I am getting "ValueError: You have to specify either input_ids or inputs_embeds" from a seemingly straightforward training example:

Iteration:   0%|                                                                                                                                                             | 0/6694 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                                                                    | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "train_masked_lm.py", line 33, in <module>
    trainer.train()
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 503, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/trainer.py", line 629, in _training_step
    outputs = model(**inputs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 639, in forward
    return_tuple,
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/torch/nn/modules/module.py", line 550, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/zm/anaconda3/envs/electra/lib/python3.7/site-packages/transformers/modeling_electra.py", line 349, in forward
    raise ValueError("You have to specify either input_ids or inputs_embeds")
ValueError: You have to specify either input_ids or inputs_embeds

My goal is to take a pre-trained model and train it a bit further based on additional data. New to transformers. Must be doing something wrong. Please help!

I adapted https://huggingface.co/blog/how-to-train as follows:

from transformers import (
  ElectraForMaskedLM,
  ElectraTokenizer,
  Trainer,
  TrainingArguments,
  LineByLineTextDataset
)

model = ElectraForMaskedLM.from_pretrained('google/electra-base-generator')
tokenizer = ElectraTokenizer.from_pretrained('google/electra-base-generator')

def to_dataset(input_file):
  return LineByLineTextDataset(file_path=input_file, tokenizer=tokenizer, block_size=128)


training_args = TrainingArguments(
  output_dir='./output',
  overwrite_output_dir=True,
  num_train_epochs=3,
  per_device_train_batch_size=64,
  per_device_eval_batch_size=64,
  save_steps=10000,
  warmup_steps=500,
  logging_dir='./logs',
)

trainer = Trainer(
  model=model,
  args=training_args,
  train_dataset=to_dataset('...../lines.txt'), # \n-separated lines of text (sentences)
)
trainer.train()

The aforementioned error fires off a few seconds after the script starts and is the very first output.

Yevgeniy
  • 1,313
  • 2
  • 13
  • 26
  • Can you post some lines from `lines.txt` please? Can you also include the full stacktrace of your error message? – cronoik Aug 04 '20 at 07:04
  • Yes, updated the post for both. Training data is just a bunch of regular English text lines. – Yevgeniy Aug 04 '20 at 20:51

0 Answers0