Does it need a relu function for bert fine tuning?

Question

For example, if it is a multi-class classification, is the following line necessary in the forward function?

final_layer = self.relu(linear_output)

The class definition is below:

class BertClassifier(nn.Module):

    def __init__(self, dropout=0.5):

        super(BertClassifier, self).__init__()

        self.bert = BertModel.from_pretrained('bert-base-cased')
        self.dropout = nn.Dropout(dropout)
        self.linear = nn.Linear(768, 5)
        self.relu = nn.ReLU()

    def forward(self, input_id, mask):

        _, pooled_output = self.bert(input_ids= input_id, attention_mask=mask,return_dict=False)
        dropout_output = self.dropout(pooled_output)
        linear_output = self.linear(dropout_output)
        #final_layer = self.relu(linear_output)
        return linear_output

You can use relu just fine, its not necessarily even the original paper stated the use of Gelu gives better result but never really give any reason as to why it performs better — Edwin Cheong, Mar 19 '23 at 07:14
@EdwinCheong The Gelu was used in pre-training, and in the paper, it didn't say whether activation (Relu or Gelu) should be used or not. Did you mean in the fine-tuning code above, the 'relu' line can be commented out? — marlon, Mar 19 '23 at 15:24
You can just comment it out but you should add before the linear layer to be exact — Edwin Cheong, Mar 19 '23 at 15:31
"but you should add before the linear layer to be exact": What do you mean? Where to add relu? — marlon, Mar 19 '23 at 16:38
Questions like yours are usually difficult to answer because they can depend on your downstream task (i.e. try it and you will know the answer). The `relu` function will give you vector representations with plenty of zeros after your fine-tuning which might be less insightful for your downstream task. — cronoik, Mar 22 '23 at 20:42

Does it need a relu function for bert fine tuning?

0 Answers0