Finetune TFBertForMaskedLM model.fit() ValueError

Question

The Problem

I have been trying to train TFBertForMaskedLM model with tensorflow. But when i use model.fit() always encounter some question.Hope someone can help and propose some solution.

Reference Paper and sample output

The Paper title is "Conditional Bert for Contextual Augmentation". In short, just change type_token_ids to label_ids. if the label of sentence is 5, length is 10 and max_sequence_length = 16. It will process output as follows:

input_ids = [101, 523, 791, 3189, 677, 5221, 524, 1920, 686, 102, 0, 0, 0, 0, 0, 0]
attention_mask = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
token_type_ids = [5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 0, 0, 0, 0, 0, 0]
labels = [-100, -100, 791, -100, 677, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]

Environment

tensorflow == 2.2.0
huggingface == 3.5.0
datasets == 1.1.2
dataset total label is 5. (1~5)
GPU : GCP P100 * 1

Dataset output (max_sequence_length=128, batch_size=1)

{'attention_mask': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>,
 'input_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([  101,   523,   791,  3189,   677,  5221,   524,  1920,   686,
         4518,  6240,   103,  2466,  2204,  2695,   100,   519,  5064,
         1918,   736,  2336,   520,   103,  2695,  1564,  4923,  8013,
          678,  6734,  8038,  8532,   131,   120,   120,  8373,   119,
          103,  9989,   103,  8450,   120,   103,   120, 12990,  8921,
         8165,   102,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0,     0,     0,     0,     0,     0,     0,     0,
            0,     0], dtype=int32)>,
 'labels': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([-100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        4634, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        4158, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, 8429, -100,  119, -100, -100,  100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100,
        -100, -100, -100, -100, -100, -100, -100], dtype=int32)>,
 'token_type_ids': <tf.Tensor: shape=(128,), dtype=int32, numpy=
 array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], dtype=int32)>}

Model code

from transformers import AdamWeightDecay, TFBertForMaskedLM, BertConfig

def create_model():
    configuration = BertConfig.from_pretrained('bert-base-chinese')
    model = TFBertForMaskedLM.from_pretrained('bert-base-chinese',
                                              config=configuration)
    model.bert.embeddings.token_type_embeddings = tf.keras.layers.Embedding(5, 768, 
                                                                            embeddings_initializer=tf.keras.initializers.TruncatedNormal(stddev=0.02))
    return model
model = create_model()

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.Mean(), tf.keras.metrics.SparseCategoricalAccuracy('accuracy')]

model.compile(optimizer = optimizer,
              loss = loss,
              metrics = metrics)

model.fit(tf_sms_dataset, 
          epochs=1,
          verbose=1)

Warning Message when use TFBertForMaskedLM

Some layers from the model checkpoint at bert-base-chinese were not used when initializing TFBertForMaskedLM: ['nsp___cls']
- This IS expected if you are initializing TFBertForMaskedLM from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing TFBertForMaskedLM from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the layers of TFBertForMaskedLM were initialized from the model checkpoint at bert-base-chinese.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFBertForMaskedLM for predictions without further training.

Error Message

ValueError                                Traceback (most recent call last)
<ipython-input-42-99b78906fef7> in <module>()
      5 model.fit(tf_sms_dataset, 
      6           epochs=1,
----> 7           verbose=1)

10 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in _method_wrapper(self, *args, **kwargs)
     64   def _method_wrapper(self, *args, **kwargs):
     65     if not self._in_multi_worker_mode():  # pylint: disable=protected-access
---> 66       return method(self, *args, **kwargs)
     67 
     68     # Running inside `run_distribute_coordinator` already.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_batch_size, validation_freq, max_queue_size, workers, use_multiprocessing)
    846                 batch_size=batch_size):
    847               callbacks.on_train_batch_begin(step)
--> 848               tmp_logs = train_function(iterator)
    849               # Catch OutOfRangeError for Datasets of unknown size.
    850               # This blocks until the batch has finished executing.

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in __call__(self, *args, **kwds)
    578         xla_context.Exit()
    579     else:
--> 580       result = self._call(*args, **kwds)
    581 
    582     if tracing_count == self._get_tracing_count():

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _call(self, *args, **kwds)
    625       # This is the first call of __call__, so we have to initialize.
    626       initializers = []
--> 627       self._initialize(args, kwds, add_initializers_to=initializers)
    628     finally:
    629       # At this point we know that the initialization is complete (or less

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in _initialize(self, args, kwds, add_initializers_to)
    504     self._concrete_stateful_fn = (
    505         self._stateful_fn._get_concrete_function_internal_garbage_collected(  # pylint: disable=protected-access
--> 506             *args, **kwds))
    507 
    508     def invalid_creator_scope(*unused_args, **unused_kwds):

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _get_concrete_function_internal_garbage_collected(self, *args, **kwargs)
   2444       args, kwargs = None, None
   2445     with self._lock:
-> 2446       graph_function, _, _ = self._maybe_define_function(args, kwargs)
   2447     return graph_function
   2448 

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _maybe_define_function(self, args, kwargs)
   2775 
   2776       self._function_cache.missed.add(call_context_key)
-> 2777       graph_function = self._create_graph_function(args, kwargs)
   2778       self._function_cache.primary[cache_key] = graph_function
   2779       return graph_function, args, kwargs

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py in _create_graph_function(self, args, kwargs, override_flat_arg_shapes)
   2665             arg_names=arg_names,
   2666             override_flat_arg_shapes=override_flat_arg_shapes,
-> 2667             capture_by_value=self._capture_by_value),
   2668         self._function_attributes,
   2669         # Tell the ConcreteFunction to clean up its graph once it goes out of

/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in func_graph_from_py_func(name, python_func, args, kwargs, signature, func_graph, autograph, autograph_options, add_control_dependencies, arg_names, op_return_value, collections, capture_by_value, override_flat_arg_shapes)
    979         _, original_func = tf_decorator.unwrap(python_func)
    980 
--> 981       func_outputs = python_func(*func_args, **func_kwargs)
    982 
    983       # invariant: `func_outputs` contains only Tensors, CompositeTensors,

/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py in wrapped_fn(*args, **kwds)
    439         # __wrapped__ allows AutoGraph to swap in a converted function. We give
    440         # the function a weak reference to itself to avoid a reference cycle.
--> 441         return weak_wrapped_fn().__wrapped__(*args, **kwds)
    442     weak_wrapped_fn = weakref.ref(wrapped_fn)
    443 
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in wrapper(*args, **kwargs)
    966           except Exception as e:  # pylint:disable=broad-except
    967             if hasattr(e, "ag_error_metadata"):
--> 968               raise e.ag_error_metadata.to_exception(e)
    969             else:
    970               raise

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1804 _minimize
        trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:521 _aggregate_gradients
        filtered_grads_and_vars = _filter_grads(grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:1219 _filter_grads
        ([v.name for _, v in grads_and_vars],))

    ValueError: No gradients provided for any variable: ['tf_bert_for_masked_lm_2/bert/embeddings/word_embeddings/weight:0', 'tf_bert_for_masked_lm_2/bert/embeddings/position_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/beta:0', 'tf_bert_for_masked_lm_2/bert/embeddings/embedding_1/embeddings:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/query/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/query/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/key/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/key/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/value/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/self/value/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/dense/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/attention/output/LayerNorm/beta:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/intermediate/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/intermediate/dense/bias:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/output/dense/kernel:0', 'tf_bert_for_masked_lm_2/bert/encoder/layer_._0/output/dense/bias:0', 'tf_bert_f...

Have Someone can help. I will thanks a lot.

Other Test

I used english sentence to test. example as follows:

from transformers import TFBertForMaskedLM, BertConfig

def create_model():
    configuration = BertConfig.from_pretrained('bert-base-uncased')
    model = TFBertForMaskedLM.from_pretrained('bert-base-uncased',
                                              config=configuration)
    return model
    
model = create_model()
eng_tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
token_info = eng_tokenizer(text="We are very happy to show you the  Transformers library.", padding='max_length', max_length=20)

optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metrics = [tf.keras.metrics.Mean(), tf.keras.metrics.SparseCategoricalAccuracy("acc")]

dataset = tf.data.Dataset.from_tensor_slices(dict(token_info))
dataset = dataset.batch(1).prefetch(tf.data.experimental.AUTOTUNE)

model.compile(optimizer = optimizer,
              loss = model.compute_loss,
              metrics = metrics)

model.fit(dataset)

token_info output dataset

{
  'input_ids': [101, 2057, 2024, 2200, 103, 2000, 2265, 2017, 103, 100, 19081, 3075, 1012, 102, 0, 0, 0, 0, 0, 0]
  'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0]
  'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
  'labels': [-100, -100, -100, -100, 3407, -100, -100, -100, 1996, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
}

Get same error.....

ValueError: in user code:

    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:571 train_function  *
        outputs = self.distribute_strategy.run(
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:951 run  **
        return self._extended.call_for_each_replica(fn, args=args, kwargs=kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2290 call_for_each_replica
        return self._call_for_each_replica(fn, args, kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/distribute/distribute_lib.py:2649 _call_for_each_replica
        return fn(*args, **kwargs)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:541 train_step  **
        self.trainable_variables)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/engine/training.py:1804 _minimize
        trainable_variables))
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:521 _aggregate_gradients
        filtered_grads_and_vars = _filter_grads(grads_and_vars)
    /usr/local/lib/python3.6/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:1219 _filter_grads
        ([v.name for _, v in grads_and_vars],))

    ValueError: No gradients provided for any variable: ['tf_bert_for_masked_lm_2/bert/embeddings/word_embeddings/weight:0', 'tf_bert_for_masked_lm_2/bert/embeddings/position_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/token_type_embeddings/embeddings:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/gamma:0', 'tf_bert_for_masked_lm_2/bert/embeddings/LayerNorm/beta:0',

I'm not sure if there is a problem with the integration of fit() into the model?