I am working on multiple-choice QA. I am using the official notebook of huggingface/transformers which is implemented for SWAG dataset.
I want to use it for other multiple-choice datasets. Therefore, I add some modifications related to dataset. all code is given in notebook.
SWAG dataset contains following columns including 'label'.
train: Dataset({
features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
num_rows: 73546
})
The dataset that I want to use has the following columns including 'answerKey' for target.
train: Dataset({
features: ['id', 'question_stem', 'choices', 'answerKey'],
num_rows: 4957
})
The error is given in dataloader which is
@dataclass
class DataCollatorForMultipleChoice:
"""
Data collator that will dynamically pad the inputs for multiple choice received.
"""
tokenizer: PreTrainedTokenizerBase
padding: Union[bool, str, PaddingStrategy] = True
max_length: Optional[int] = None
pad_to_multiple_of: Optional[int] = None
def __call__(self, features):
print(features[0].keys())
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
batch_size = len(features)
num_choices = len(features[0]["input_ids"])
flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
flattened_features = sum(flattened_features, [])
batch = self.tokenizer.pad(
flattened_features,
padding=self.padding,
max_length=self.max_length,
pad_to_multiple_of=self.pad_to_multiple_of,
return_tensors="pt",
)
# Un-flatten
batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
# Add back labels
batch["labels"] = torch.tensor(labels, dtype=torch.int64)
return batch
It is given the error in the following line:
label_name = "label" if "label" in features[0].keys() else "labels"
labels = [feature.pop(label_name) for feature in features]
the error is obtained in trainer.train()
KeyError Traceback (most recent call last)
<ipython-input-64-3435b262f1ae> in <module>()
----> 1 trainer.train()
5 frames
<ipython-input-60-d1262e974b03> in <listcomp>(.0)
18 print(features[0].keys())
19 label_name = "label" if "label" in features[0].keys() else "labels"
---> 20 labels = [feature.pop(label_name) for feature in features]
21 batch_size = len(features)
22 num_choices = len(features[0]["input_ids"])
KeyError: 'labels'
I don't know what causes the error. I think it is related to target keys. But I could not solve it. Any ideas?
Thanks,