I want to implement with Tensorflow a speech recognizer with CTC loss. The input features have variable lenghts because each speech utterance can have variable length. The labels also have variable length because each transcription is different. I manually pad the features to create the batches and in my model I have tf.keras.layers.Masking() layer to create and propagate the mask through the network. I also create the labels batch with padding.
Here is a dummy example. Let's imagine that I have two utterances of length 3 and 5 frames respectively. Each frame is represented by one single feature (normally this would be 13 MFCCs but I reduce it to one to keep it simple). So to create the batch I pad the short utterance with 0 at the end:
features = np.array([1.5 2.3 4.6 0.0 0.0],
[1.7 2.6 3.4 2.3 1.0])
The labels are the transcription of these utterances. Let's say that the lengths are 2 and 3 respectively. The labels batch shape will be [2, 3, 26], where 2 in the batch size, 3 is the maximum length and 26 is the number of character in English (one-hot encoding).
The model is:
input_ = tf.keras.Input(shape=(None,1))
x = tf.keras.layers.Masking()(input_)
x = tf.keras.layers.GRU(26, return_sequences=True)(input_)
output_ = tf.keras.layers.Softmax(axis=-1)(x)
model = tf.keras.Model(input_,output_)
The loss function is something like:
def ctc_loss(y_true, y_pred):
# Do something here to get logit_length and label_length?
# ...
loss = tf.keras.backend.ctc_batch_cost(y_true,y_pred,logit_length,label_length)
My question is how to get logit_length and label_length. I would suppose that logit_length is encoded in the mask, but if I do y_pred._keras_mask, the result is None. For label_length, the information is in the tensor itself, but I'm not sure of the most efficient way of getting it.
Thanks.
UPDATE:
Following Tou You's answer, I use tf.math.count_nonzero to get the label_length, and I set logit_length to the length of the logit layer.
So the shapes inside the loss function are (batch size = 10):
y_true.shape = (10, None)
y_pred.shape = (10, None, 27)
label_length.shape = (10,1)
logit_lenght.shape = (10,1)
Of course the 'None' of y_true and y_pred are not the same, since one is the maximum string length of the batch and the other is the maximum number of time frames of the batch. However, when I call model.fit() and in the loss tf.keras.backend.ctc_batch_cost() with those parameters, I get the error:
Traceback (most recent call last):
File "train.py", line 164, in <module>
model.fit(dataset, batch_size=batch_size, epochs=10)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1661, in _filtered_call
return self._call_flat(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 1745, in _call_flat
return self._build_call_outputs(self._inference_function.call(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/function.py", line 593, in call
outputs = execute.execute(
File "/home/pablo/miniconda3/envs/lightvoice/lib/python3.8/site-packages/tensorflow/python/eager/execute.py", line 59, in quick_execute
tensors = pywrap_tfe.TFE_Py_Execute(ctx._handle, device_name, op_name,
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
(0) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
(1) Invalid argument: Incompatible shapes: [10,92] vs. [10,876]
[[node Equal (defined at train.py:164) ]]
[[ctc_loss/Log/_62]]
0 successful operations.
0 derived errors ignored. [Op:__inference_train_function_3156]
Function call stack:
train_function -> train_function
It looks like it is complaining that the length of y_true (92) is not the same as the length of y_pred (876), which I thought should not be. What am I missing?