Invert Tokenizer Keras

Question

I'm trying to make an autoencoder of text under Keras. I use the Tokenizer from the preprocess module.

After I trained the encoder and the decoder I was wondering how to revert the list of integer into a list of words. I searched in the official doc but there is nothing to invert the Tokenizer process.

Do you have an idea ?

Thanks to you!

There is a `word_index` attribute which is a dictionary that maps words to indices. You try inverting that. — nemo, Mar 22 '17 at 18:15

Gennaro · Answer 1 · 2021-10-22T15:33:06.310

from tensorflow.keras.preprocessing.text        import Tokenizer

Assuming your corpus is inside the corpus variable:

tok_obj = Tokenizer(num_words=10, oov_token='<OOV>')
tok_obj.fit_on_texts(corpus)

Assuming your sequences are contained into processed_seq. Here an example:

processed_seq = tok_obj.texts_to_sequences(['senteces you want to predict on here'])

Build the dictionary inv_map and use It! list comprehension can be used below to compress the code.

inv_map = {v: k for k, v in tok_obj.word_index.items()}

for seq in processed_seq:
    for tok in seq:
        print(inv_map[tok])

If you need a complete example you may refer to this answer.

Invert Tokenizer Keras

1 Answers1