0

I am working on multiple-choice QA. I am using the official notebook of huggingface/transformers which is implemented for SWAG dataset.

I want to use it for other multiple-choice datasets. Therefore, I add some modifications related to dataset. all code is given in notebook.

SWAG dataset contains following columns including 'label'.

 train: Dataset({
        features: ['video-id', 'fold-ind', 'startphrase', 'sent1', 'sent2', 'gold-source', 'ending0', 'ending1', 'ending2', 'ending3', 'label'],
        num_rows: 73546
    })

The dataset that I want to use has the following columns including 'answerKey' for target.

train: Dataset({
        features: ['id', 'question_stem', 'choices', 'answerKey'],
        num_rows: 4957
    })

The error is given in dataloader which is

@dataclass
class DataCollatorForMultipleChoice:
    """
    Data collator that will dynamically pad the inputs for multiple choice received.
    """

    tokenizer: PreTrainedTokenizerBase
    padding: Union[bool, str, PaddingStrategy] = True
    max_length: Optional[int] = None
    pad_to_multiple_of: Optional[int] = None

    def __call__(self, features):
        print(features[0].keys())
        label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]
        batch_size = len(features)
        num_choices = len(features[0]["input_ids"])
        flattened_features = [[{k: v[i] for k, v in feature.items()} for i in range(num_choices)] for feature in features]
        flattened_features = sum(flattened_features, [])
        
        batch = self.tokenizer.pad(
            flattened_features,
            padding=self.padding,
            max_length=self.max_length,
            pad_to_multiple_of=self.pad_to_multiple_of,
            return_tensors="pt",
        )
        
        # Un-flatten
        batch = {k: v.view(batch_size, num_choices, -1) for k, v in batch.items()}
        # Add back labels
        batch["labels"] = torch.tensor(labels, dtype=torch.int64)
        return batch

It is given the error in the following line:

label_name = "label" if "label" in features[0].keys() else "labels"
        labels = [feature.pop(label_name) for feature in features]

the error is obtained in trainer.train()

KeyError                                  Traceback (most recent call last)
<ipython-input-64-3435b262f1ae> in <module>()
----> 1 trainer.train()

5 frames
<ipython-input-60-d1262e974b03> in <listcomp>(.0)
     18         print(features[0].keys())
     19         label_name = "label" if "label" in features[0].keys() else "labels"
---> 20         labels = [feature.pop(label_name) for feature in features]
     21         batch_size = len(features)
     22         num_choices = len(features[0]["input_ids"])

KeyError: 'labels'

I don't know what causes the error. I think it is related to target keys. But I could not solve it. Any ideas?

Thanks,

1 Answers1

0

I got the same error, and realised It was due to the lookup, it's looking for either "label" or "labels" as feature in your dataset.

Perhaps, if your answerKey is the label, you can rename this field.

Laura Uzcategui
  • 2,062
  • 1
  • 11
  • 6