There is a tutorial for creating a Transformer Chatbot that takes several lists of encoded-by-word sentences with different lengths, and first pads that length difference with tf.keras.preprocessing
, then creates the dataset from such encoded sentences.
I was trying to first create the dataset then pad and batch it with dataset.padded_batch()
since it seemed more cohesive using a single API.
The problem I found is , if not padded beforehand, I am left with a list of lists with different lengths (such as the list of encoded questions below), that cannot be directly used to create a dataset (from my understanding).
Sample question: i really, really, really wanna go, but i can t. not unless my sister goes.
Sample answer: i m workin on it. but she doesn t seem to be goin for him.
Encoded sample question: [3, 224, 1, 224, 1, 154, 295, 180, 1, 42, 3, 32, 5335, 4, 31, 589, 27, 416, 1387, 5265]
List of encoded questions: [[5475, 32, 16, 106, 38, 2392, 25, 3796, 4313, 11, 5143, 5073, 34, 565, 108, 1099, 4422, 1278, 1929, 76, 45, 5, 3911, 4, 272, 5265, 5476], [5475, 77, 1, 3, 168, 16, 69, 246, 37, 2412, 1, 49, 13, 8, 1315, 37, 35, 5265, 5476], ...]
The way to create a dataset without padding from such type of objects is apparently to use tf.RaggedTensor
(s). I can now create a dataset with tf.data.Dataset.from_tensor_slices
, but haven't been able to pad the dataset afterwards.
The error that I get is the following, which is related to the specified shape of the contents of my dataset (which is 1 dimensional since it is just a list of lists of different length):
dataset = dataset.padded_batch(BATCH_SIZE, padded_shapes=({'inputs': (None, MAX_LENGTH),
'dec_inputs': (None, MAX_LENGTH)},
{'outputs': (None, MAX_LENGTH)}))
Traceback (most recent call last):
File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\pydevd.py", line 1415, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File "C:\Program Files\JetBrains\PyCharm Community Edition 2019.2.4\helpers\pydev\_pydev_imps\_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "C:/Users/---/Desktop/IA/PyProjects/NLP_models/Natural Language Inference/transformer_chatbot.py", line 208, in <module>
{'outputs': (None, MAX_LENGTH)}))
File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 1097, in padded_batch
drop_remainder)
File "C:\Users\---\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3341, in __init__
_padded_shape_to_tensor(padded_shape, input_component_shape))
File "C:\Users\----\Desktop\IA\PyProjects\NLP_models\venv\lib\site-packages\tensorflow_core\python\data\ops\dataset_ops.py", line 3269, in _padded_shape_to_tensor
% (padded_shape_as_shape, input_component_shape))
ValueError: The padded shape (None, 40) is not compatible with the corresponding input component shape (None,).
How can I give some reference shape to dataset.padded_batch() so that it can padd the second dimension with some MAX_LEN. Or is it any other way of padding it without that tf.keras.preprocessing step?