2

I'm trying to make an autoencoder of text under Keras. I use the Tokenizer from the preprocess module.

After I trained the encoder and the decoder I was wondering how to revert the list of integer into a list of words. I searched in the official doc but there is nothing to invert the Tokenizer process.

Do you have an idea ?

Thanks to you!

Pusheen_the_dev
  • 2,077
  • 4
  • 17
  • 33
  • 3
    There is a `word_index` attribute which is a dictionary that maps words to indices. You try inverting that. – nemo Mar 22 '17 at 18:15

1 Answers1

0
from tensorflow.keras.preprocessing.text        import Tokenizer

Assuming your corpus is inside the corpus variable:

tok_obj = Tokenizer(num_words=10, oov_token='<OOV>')
tok_obj.fit_on_texts(corpus)

Assuming your sequences are contained into processed_seq. Here an example:

processed_seq = tok_obj.texts_to_sequences(['senteces you want to predict on here'])

Build the dictionary inv_map and use It! list comprehension can be used below to compress the code.

inv_map = {v: k for k, v in tok_obj.word_index.items()}

for seq in processed_seq:
    for tok in seq:
        print(inv_map[tok])

If you need a complete example you may refer to this answer.

Gennaro
  • 138
  • 1
  • 2
  • 16