1

Tensorflow Tokenizer tokenizes and encodes text into machine readable vectors. First we call fit_on_texts on some large amount of text to build a dictionary, then we call fit_on_sequences on our input text to build the corresponding vectors incoding.

What does Keras Tokenizer method exactly do?

However, there seems not to be a built-in method for the reverse operation, for retrieving text from numerical vectors, based on the dictionary.

In Python something like this could be implemented

 # map predicted word index to word
 out_word=''
 for word, index in tokenizer.word_index.items():
     if index==yhat:
         out_word=word
         break

Is there a nice way to retrieve text from digit, in other words is there a built-in reverse operation of fit_to_sequences?

kiriloff
  • 25,609
  • 37
  • 148
  • 229
  • you have two main methods in a Tokenizer texts_to_sequences to get the sequence and sequences_to_texts to do the inverse, I don't know what you mean by reverse fit_to_sequences – Pedro Fillastre Sep 29 '21 at 09:04

1 Answers1

0

There is build-in method available for retrieving text from numerical vectors.

For instance check below code:

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.text import Tokenizer

Sentences=["Life is Beautiful"]

tokenizer= Tokenizer(num_words= 30)
tokenizer.fit_on_texts(Sentences)

word_index=tokenizer.word_index
print("Word Index: ", word_index)

reverse_word_index = dict([(value, key) for (key, value) in word_index.items()])
print("Reversed Word Index:", reverse_word_index)

seq = tokenizer.texts_to_sequences(Sentences)
print("Texts to Numbers:",seq)

seq_to_wrd=tokenizer.sequences_to_texts(seq)
print("Numbers to Texts:",seq_to_wrd)

Output:

Word Index:  {'life': 1, 'is': 2, 'beautiful': 3}
Reversed Word Index: {1: 'life', 2: 'is', 3: 'beautiful'}
Texts to Numbers: [[1, 2, 3]]
Numbers to Texts: ['life is beautiful']

Check this link to find more Tokenizer built-in functions.