0

I am trying to use the HuggingFace library to fine-tune the T5 transformer model using a custom dataset. HF provide an example of fine-tuning with custom data but this is for distilbert model, not the T5 model I want to use. From their example it says I need to implement len and getitem methods in my dataset subclass, but there doesn't seem to be much documentation about what to change when using t5 instead of distilbert. Here is the tokenizer code followed by my attempt at changing getitem

getitem method code

and the resulting error from trainer.train() which says " KeyError: 'labels' "

trainer.train() error message

I have seen the following discussion which seems to relate to this problem, but the answer offered still produces an error in trainer.train() which I can also post if useful.

Using the original example code from the "fine-tuning with custom data" then the dataset class is:

original code from hf distilbert example applied to T5

but then the error with the trainer changes:

trainer error using hf distilbert example applied to T5

which is what originally got me looking around for solutions. So using "fine-tuning with custom data" doesn't seem to be as simple as changing the model and the tokenizer (and the input/output data you are training on) when switching from say distilbert to a text to text model like T5. distilbert doesn't have any output text to train on, so I would have thought (but what do I know?) it would be different to T5 but I can't find documentation on how? At the bottom of this question seems to point to a direction to follow but once again I don't know (much!)

I think I may have solved the problem (at least the trainer runs and completes). The distilbert model doesn't have output text, it has flags that are provided to the dataset class as a list of integers. The T5 model has output text, so you assign the output encodings and rely upon DataCollatorForSeq2Seq() to prepare the data/featurs that the T5 model expects. See changes (for T5) with commented out HF code (for distilbert) below:

Changes for T5 - commented out distilbert code

Raised an issue to HuggingFace and they advised that the fine-tuning with custom datasets example on their website was out of date and that I needed to work off their maintained examples.

1 Answers1

0

Based on your screenshots, here's how I'd implement len and getitem.

class ToxicDataset(torch.utils.data.Dataset):
    def __init__(self, encodings, labels):
        self.encodings = encodings
        self.labels = labels
    def __getitem__(self, idx):
        item = {key: torch.tensor(val[idx]) for key, val in self.encodings.items()}
        item['labels'] = torch.tensor(self.labels['input_ids'][idx])
        return item
    def __len__(self):
        return len(self.labels['input_ids']) 
Mbuotidem Isaac
  • 699
  • 9
  • 10