I want to add extra layers into original BERT (adapter) where only the adapter will be trained during training while the original BERT network will be frozen. Is it possible to initialise the adapter wights with Kaiming initialisation?
(introduced in the paper: Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification)
The adapter consists of two fully connected layers and a non-linear activation function (ReLU) in between. I appreciate it if you can help. Thank you
I want to initialise the adapter layers in an effective way. I read this paper recently and I was wondering if it is possible to use it in this case! I am not sure whether it will disrupt the original BERT's pre-trained knowledge!