Finetuning Transformers in PyTorch (BERT, RoBERTa, etc.)

Question

Alright. So there are multiple methods to fine tune a transformer:

freeze transformer's parameters and only its final outputs are fed into another model (user trains this "another" model),
the whole transformer, with a user-added custom layer, is fine tuned. Multiple papers in top conferences use the second method. The same goes for those "how to fine-tune BERT" blog posts, which usually define a PyTorch custom layer as a nn.Module object. A common implementation might be such:]

#Example 1 Start)

from transformers import RobertaModel
import pytorch

class ClassificationHead(nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = nn.Linear(args.hidden_dim, args.hidden_dim)
        classifier_dropout = (args.drop_out if args.drop_out is not None else 0.1)
        self.dropout = nn.Dropout(classifier_dropout)
        self.out_proj = nn.Linear(args.hidden_dim, args.num_labels)

    def forward(self, features):
        x = features[:, 0, :]  # take <s> token (equiv. to [CLS])
        x = self.dropout(x)
        x = self.dense(x)
        x = torch.tanh(x)
        x = self.dropout(x)
        x = self.out_proj(x)
        return x

class SequenceClassification(nn.Module):
    def __init__(self):
        super().__init__()
        self.num_labels = args.num_labels
        self.model = RobertaModel.from_pretrained(roberta-base)
        self.classifier = ClassificationHead()

    def forward(self, input_ids, attention_mask):
        outputs = self.model(input_ids=input_ids, attention_mask=attention_mask)
        sequence_output = outputs[0] #last hidden state
        logits = self.classifier(sequence_output)
        return logits

#Example 1 End)

#Example 2 Start)

class BertBinaryClassifier(nn.Module):
    def __init__(self, dropout=0.1):
        super(BertBinaryClassifier, self).__init__()
        self.bert = BertModel.from_pretrained('bert-base-uncased')
        self.linear = nn.Linear(768, 1)
        self.sigmoid = nn.Sigmoid()
    
    def forward(self, tokens):
        _, pooled_output = self.bert(tokens, utput_all=False)
        linear_output = self.linear(dropout_output)
        proba = self.sigmoid(linear_output)
        return proba

#Example 2 End)

So, I was wondering if model classes like above can be used to train the transformer model itself. From what it seems, it's like as if transformer models are only used to give output (feature extraction) and the user is training only the new custom layers (like method 1 above).

But many blog posts say above codes correspond to method 2, where the whole transformer is fine-tuned along with the new custom layer. How is this so?

If the above codes are method 2, how does one freeze BERT model and train only the new custom layer? Does anyone know a good code implementation to refer to?

Finetuning Transformers in PyTorch (BERT, RoBERTa, etc.)

0 Answers0