How do I pad a tf.data.Dataset that contains non-uniform tf.RaggedTensors?

Question

There is a tutorial for creating a Transformer Chatbot that takes several lists of encoded-by-word sentences with different lengths, and first pads that length difference with tf.keras.preprocessing, then creates the dataset from such encoded sentences.

I was trying to first create the dataset then pad and batch it with dataset.padded_batch() since it seemed more cohesive using a single API.

The problem I found is , if not padded beforehand, I am left with a list of lists with different lengths (such as the list of encoded questions below), that cannot be directly used to create a dataset (from my understanding).

Sample question: i really, really, really wanna go, but i can t. not unless my sister goes.
Sample answer: i m workin on it. but she doesn t seem to be goin for him.
Encoded sample question: [3, 224, 1, 224, 1, 154, 295, 180, 1, 42, 3, 32, 5335, 4, 31, 589, 27, 416, 1387, 5265]
List of encoded questions: [[5475, 32, 16, 106, 38, 2392, 25, 3796, 4313, 11, 5143, 5073, 34, 565, 108, 1099, 4422, 1278, 1929, 76, 45, 5, 3911, 4, 272, 5265, 5476], [5475, 77, 1, 3, 168, 16, 69, 246, 37, 2412, 1, 49, 13, 8, 1315, 37, 35, 5265, 5476], ...]

The way to create a dataset without padding from such type of objects is apparently to use tf.RaggedTensor(s). I can now create a dataset with tf.data.Dataset.from_tensor_slices, but haven't been able to pad the dataset afterwards.

The error that I get is the following, which is related to the specified shape of the contents of my dataset (which is 1 dimensional since it is just a list of lists of different length):

dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=({'inputs': (None, MAX_LENGTH),
                                                               'dec_inputs': (None, MAX_LENGTH)},
                                                              {'outputs': (None, MAX_LENGTH)}))


Traceback (most recent call last):
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "C:/Users/---/Desktop/IA/PyProjects/NLP_models/Natural Language Inference/transformer_chatbot.py", line 208, in <module>
    {'outputs': (None, MAX_LENGTH)}))
  File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 1097, in padded_batch
    drop_remainder)
  File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3341, in __init__
    _padded_shape_to_tensor(padded_shape, input_component_shape))
  File "C:\Users\----\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3269, in _padded_shape_to_tensor
    % (padded_shape_as_shape, input_component_shape))

ValueError: The padded shape (None, 40) is not compatible with the corresponding input component shape (None,).

How can I give some reference shape to dataset.padded_batch() so that it can padd the second dimension with some MAX_LEN. Or is it any other way of padding it without that tf.keras.preprocessing step?

Q&A in this question might help you out https://stackoverflow.com/questions/53938962/in-tensorflow-dataset-api-how-to-use-padded-batch-so-that-a-pads-with-a-specifi — thushv89, Nov 29 '19 at 12:50
Shouldn't the padded shape have just one element? If each element is a vector, you would only have to pad that dimension, right? The batching after that would add the batch dimension. — jdehesa, Nov 29 '19 at 14:14
@jdehesa you are right, but how then can you set a MAX_LEN for the padded tensors? Also, I have tried as you said setting padded shapes to (None,) and now I am getting the following error: `Mismatched type between padding value 0 and input dataset's component 0: int32 vs. variant` — David, Nov 29 '19 at 15:47

How do I pad a tf.data.Dataset that contains non-uniform tf.RaggedTensors?

0 Answers0