I load the Roberta model by TFRobertaModel.frompretrained('Roberta-base') and train it using Keras. I have other layers on top of the Roberta and I need to initialize the bare Roberta with all parameters. I run my code on Colab, and since a few weeks age when loading the Roberta I used to receive the following warning, but still, everything was ok and the model was training properly although ‘lm_head’ weights were not initialized:
Some weights of the model checkpoint at Roberta-base were not used when initializing ROBERTA: [‘lm_head’]
but now, I think the version of transformers on colab has been changed because I get the new warning with the same code, indicating more encoder and bias layers are not initialized and this leads to no decrease in loss function:
Some layers from the model checkpoint at roberta-base were not used when initializing ROBERTA: ['lm_head', 'encoder/layer_._3/attention/self/value/bias:0', 'encoder/layer_._10/attention/self/value/bias:0', 'encoder/layer_._10/attention/self/key/kernel:0', 'pooler/dense/bias:0', 'encoder/layer_._9/attention/self/query/kernel:0', 'encoder/layer_._10/attention/self/query/kernel:0', 'encoder/layer_._7/attention/output/dense/bias:0', 'embeddings/position_embeddings/embeddings:0', 'encoder/layer_._6/intermediate/dense/kernel:0', 'encoder/layer_._11/intermediate/dense/kernel:0', 'encoder/layer_._8/intermediate/dense/bias:0', 'encoder/layer_._10/attention/self/value/kernel:0', 'encoder/layer_._7/output/dense/bias:0', 'encoder/layer_._6/attention/self/value/bias:0', 'encoder/layer_._8/attention/output/dense/kernel:0', 'encoder/layer_._10/intermediate/dense/kernel:0', 'encoder/layer_._5/attention/self/value/kernel:0', 'encoder/layer_._6/attention/output/LayerNorm/gamma:0', 'encoder/layer_._7/attention/self/query/kernel:0', 'encoder/layer_._6/attention/self/query/kernel:0', 'encoder/layer_._6/attention/self/key/bias:0', 'encoder/layer_._8/attention/output/LayerNorm/gamma:0', 'encoder/layer_._2/output/dense/kernel:0', 'encoder/layer_._11/intermediate/dense/bias:0', 'encoder/layer_._6/output/dense/kernel:0', 'encoder/layer_._2/intermediate/dense/kernel:0', 'encoder/layer_._3/intermediate/dense/kernel:0', 'encoder/layer_._10/output/LayerNorm/beta:0', 'encoder/layer_._6/attention/self/query/bias:0', 'encoder/layer_._6/attention/output/LayerNorm/beta:0', 'encoder/layer_._9/attention/self/value/bias:0', 'encoder/layer_._8/attention/self/query/kernel:0', 'encoder/layer_._0/output/LayerNorm/gamma:0', 'encoder/layer_._11/attention/output/dense/bias:0', 'encoder/layer_._7/attention/self/value/bias:0', 'encoder/layer_._0/attention/output/dense/kernel:0', 'encoder/layer_._9/intermediate/dense/bias:0', 'encoder/layer_._2/attention/self/query/kernel:0', 'encoder/layer_._0/attention/self/key/bias:0', 'encoder/layer_._8/attention/output/LayerNorm/beta:0', 'encoder/layer_._1/attention/self/value/kernel:0', 'encoder/layer_._6/output/LayerNorm/gamma:0', 'encoder/layer_._1/attention/output/dense/bias:0', 'encoder/layer_._3/attention/self/query/bias:0', 'encoder/layer_._3/output/dense/bias:0', 'encoder/layer_._1/attention/self/key/kernel:0', 'encoder/layer_._8/attention/self/key/kernel:0', 'encoder/layer_._9/intermediate/dense/kernel:0', 'encoder/layer_._3/output/dense/kernel:0', 'encoder/layer_._2/output/LayerNorm/beta:0', 'encoder/layer_._7/attention/self/key/bias:0', 'encoder/layer_._5/attention/self/key/kernel:0', 'encoder/layer_._5/attention/self/query/bias:0', 'encoder/layer_._2/attention/output/dense/bias:0', 'encoder/layer_._4/intermediate/dense/kernel:0', 'encoder/layer_._1/intermediate/dense/bias:0', 'encoder/layer_._4/attention/self/value/kernel:0', 'encoder/layer_._11/attention/self/key/bias:0', 'encoder/layer_._5/output/dense/kernel:0', 'encoder/layer_._1/output/dense/bias:0', 'encoder/layer_._0/attention/self/value/bias:0', 'encoder/layer_._6/attention/self/key/kernel:0', 'encoder/layer_._9/attention/self/key/bias:0', 'encoder/layer_._7/output/LayerNorm/gamma:0', 'encoder/layer_._8/attention/output/dense/bias:0', 'encoder/layer_._10/attention/output/dense/bias:0', 'encoder/layer_._0/intermediate/dense/kernel:0', 'encoder/layer_._5/intermediate/dense/kernel:0', 'encoder/layer_._11/attention/self/value/kernel:0', 'encoder/layer_._8/attention/self/key/bias:0', 'encoder/layer_._8/output/dense/bias:0', 'encoder/layer_._8/intermediate/dense/kernel:0', 'encoder/layer_._7/attention/output/LayerNorm/beta:0', 'encoder/layer_._2/output/dense/bias:0', 'encoder/layer_._3/attention/output/dense/bias:0', 'encoder/layer_._0/output/dense/bias:0', 'encoder/layer_._9/attention/self/key/kernel:0', 'encoder/layer_._11/output/dense/bias:0', 'encoder/layer_._7/attention/self/query/bias:0', 'encoder/layer_._10/attention/self/key/bias:0', 'encoder/layer_._2/attention/output/dense/kernel:0', 'encoder/layer_._2/attention/self/query/bias:0', 'encoder/layer_._9/attention/output/dense/kernel:0', 'encoder/layer_._9/attention/output/LayerNorm/gamma:0', 'encoder/layer_._9/output/LayerNorm/gamma:0', 'encoder/layer_._0/attention/output/LayerNorm/beta:0', 'encoder/layer_._1/intermediate/dense/kernel:0', 'encoder/layer_._1/output/dense/kernel:0', 'encoder/layer_._1/attention/self/key/bias:0', 'encoder/layer_._2/attention/self/value/kernel:0', 'encoder/layer_._9/attention/self/value/kernel:0', 'encoder/layer_._10/intermediate/dense/bias:0', 'encoder/layer_._4/intermediate/dense/bias:0', 'encoder/layer_._6/output/LayerNorm/beta:0', 'encoder/layer_._7/output/LayerNorm/beta:0', 'encoder/layer_._11/attention/self/query/bias:0', 'encoder/layer_._0/intermediate/dense/bias:0', 'encoder/layer_._11/attention/output/dense/kernel:0', 'encoder/layer_._5/attention/self/query/kernel:0', 'encoder/layer_._8/attention/self/value/kernel:0', 'encoder/layer_._11/output/LayerNorm/beta:0', 'encoder/layer_._9/output/dense/bias:0', 'encoder/layer_._4/output/dense/bias:0', 'encoder/layer_._2/attention/self/key/bias:0', 'encoder/layer_._3/attention/self/query/kernel:0', 'encoder/layer_._4/attention/output/LayerNorm/gamma:0', 'encoder/layer_._1/attention/output/LayerNorm/beta:0', 'encoder/layer_._1/output/LayerNorm/beta:0', 'encoder/layer_._10/attention/output/LayerNorm/beta:0', 'encoder/layer_._3/attention/self/value/kernel:0', 'encoder/layer_._10/attention/self/query/bias:0', 'encoder/layer_._3/attention/self/key/bias:0', 'pooler/dense/kernel:0', 'encoder/layer_._1/attention/self/value/bias:0', 'encoder/layer_._7/attention/self/key/kernel:0', 'encoder/layer_._1/attention/output/dense/kernel:0', 'encoder/layer_._4/attention/self/key/kernel:0', 'encoder/layer_._8/output/dense/kernel:0', 'encoder/layer_._3/attention/output/LayerNorm/gamma:0', 'encoder/layer_._0/attention/self/value/kernel:0', 'encoder/layer_._3/attention/self/key/kernel:0', 'encoder/layer_._0/attention/self/query/kernel:0', 'encoder/layer_._3/intermediate/dense/bias:0', 'encoder/layer_._7/output/dense/kernel:0', 'encoder/layer_._10/output/dense/kernel:0', 'encoder/layer_._7/intermediate/dense/bias:0', 'embeddings/word_embeddings/weight:0', 'encoder/layer_._3/attention/output/LayerNorm/beta:0', 'encoder/layer_._0/attention/self/key/kernel:0', 'encoder/layer_._4/output/dense/kernel:0', 'encoder/layer_._5/output/LayerNorm/gamma:0', 'encoder/layer_._9/attention/output/dense/bias:0', 'encoder/layer_._0/attention/output/dense/bias:0', 'encoder/layer_._5/attention/output/LayerNorm/gamma:0', 'encoder/layer_._9/attention/output/LayerNorm/beta:0', 'encoder/layer_._11/output/LayerNorm/gamma:0', 'encoder/layer_._11/attention/output/LayerNorm/gamma:0', 'encoder/layer_._6/intermediate/dense/bias:0', 'encoder/layer_._2/attention/output/LayerNorm/gamma:0', 'encoder/layer_._5/output/dense/bias:0', 'encoder/layer_._0/output/dense/kernel:0', 'encoder/layer_._6/attention/output/dense/kernel:0', 'encoder/layer_._6/attention/output/dense/bias:0', 'encoder/layer_._1/attention/self/query/kernel:0', 'encoder/layer_._0/attention/self/query/bias:0', 'encoder/layer_._11/attention/self/value/bias:0', 'encoder/layer_._2/intermediate/dense/bias:0', 'embeddings/LayerNorm/beta:0', 'encoder/layer_._4/attention/output/dense/kernel:0', 'encoder/layer_._3/output/LayerNorm/beta:0', 'encoder/layer_._8/output/LayerNorm/gamma:0', 'encoder/layer_._10/attention/output/dense/kernel:0', 'encoder/layer_._11/output/dense/kernel:0', 'encoder/layer_._2/attention/output/LayerNorm/beta:0', 'encoder/layer_._7/attention/output/dense/kernel:0', 'encoder/layer_._9/attention/self/query/bias:0', 'encoder/layer_._4/attention/self/key/bias:0', 'encoder/layer_._2/output/LayerNorm/gamma:0', 'encoder/layer_._0/attention/output/LayerNorm/gamma:0', 'encoder/layer_._1/attention/output/LayerNorm/gamma:0', 'encoder/layer_._1/attention/self/query/bias:0', 'encoder/layer_._5/attention/output/LayerNorm/beta:0', 'encoder/layer_._10/output/dense/bias:0', 'encoder/layer_._8/output/LayerNorm/beta:0', 'encoder/layer_._5/output/LayerNorm/beta:0', 'embeddings/token_type_embeddings/embeddings:0', 'encoder/layer_._5/attention/output/dense/bias:0', 'encoder/layer_._4/output/LayerNorm/beta:0', 'encoder/layer_._4/attention/self/query/kernel:0', 'encoder/layer_._5/attention/output/dense/kernel:0', 'encoder/layer_._7/attention/self/value/kernel:0', 'encoder/layer_._7/intermediate/dense/kernel:0', 'encoder/layer_._11/attention/self/key/kernel:0', 'encoder/layer_._3/output/LayerNorm/gamma:0', 'encoder/layer_._10/output/LayerNorm/gamma:0', 'encoder/layer_._8/attention/self/query/bias:0', 'encoder/layer_._3/attention/output/dense/kernel:0', 'encoder/layer_._4/output/LayerNorm/gamma:0', 'encoder/layer_._10/attention/output/LayerNorm/gamma:0', 'encoder/layer_._4/attention/self/value/bias:0', 'encoder/layer_._11/attention/self/query/kernel:0', 'encoder/layer_._4/attention/output/dense/bias:0', 'encoder/layer_._4/attention/output/LayerNorm/beta:0', 'encoder/layer_._5/attention/self/key/bias:0', 'encoder/layer_._6/attention/self/value/kernel:0', 'encoder/layer_._5/attention/self/value/bias:0', 'encoder/layer_._11/attention/output/LayerNorm/beta:0', 'encoder/layer_._1/output/LayerNorm/gamma:0', 'encoder/layer_._2/attention/self/value/bias:0', 'encoder/layer_._9/output/dense/kernel:0', 'encoder/layer_._2/attention/self/key/kernel:0', 'encoder/layer_._9/output/LayerNorm/beta:0', 'encoder/layer_._7/attention/output/LayerNorm/gamma:0', 'encoder/layer_._5/intermediate/dense/bias:0', 'embeddings/LayerNorm/gamma:0', 'encoder/layer_._0/output/LayerNorm/beta:0', 'encoder/layer_._6/output/dense/bias:0', 'encoder/layer_._8/attention/self/value/bias:0', 'encoder/layer_._4/attention/self/query/bias:0']
can anyone help me with my question: How I can load the Roberta and initialize all of its weights properly?