I understand that you are wanting to extract the embedding for each word, but I think the real question is: What is the output the tokenizer is producing.
Also, that tokenizer is a bit of a mess. You'll see what I mean below.
Because the tokenizer will filter words (assuming a non-trivial vocabulary), I don't want to assume that the words are stored in the order in which they are found. So here I programmatically determine the vocabulary using word_index
. I then explicitly check what words are tokenized after filtering for the most frequently used words. (Word_index remembers all words; i.e. the pre-filtered values.)
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
corpus = 'I like turtles'
num_words = len(corpus.split())
oov = 'OOV'
tokenizer = Tokenizer(num_words=num_words + 2, oov_token=oov)
tokenizer.fit_on_texts(corpus.split())
print(f'word_index: {tokenizer.word_index}')
print(f'vocabulary: {tokenizer.word_index.keys()}')
text = [key for key in tokenizer.word_index.keys()]
print(f'keys: {text}: {tokenizer.texts_to_sequences(text)}')
text = 'I like turtles'.split()
print(f'{text}: {tokenizer.texts_to_sequences(text)}')
text = 'I like marshmallows'.split()
print(f'{text}: {tokenizer.texts_to_sequences(text)}')
This produces the following output:
word_index: {'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}
vocabulary: dict_keys(['OOV', 'i', 'like', 'turtles'])
keys: ['OOV', 'i', 'like', 'turtles']: [[1], [2], [3], [4]]
['I', 'like', 'turtles']: [[2], [3], [4]]
['I', 'like', 'marshmallows']: [[2], [3], [1]]
However, if you specify oov_token, the output looks like this:
{'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}
Notice how I had to specify num_words=num_words + 2
instead of the expected '+1'.
That's because we're explicitly defining an OOV token, which gets added to the vocabulary, which is a bit nuts imo.
If you specify an OOV token and you set num_words=num_words + 1
(as documented), then 'I like turtles' gets the same encoding as 'I like marshmallows'. Also nuts.
Hopefully, you now have to tools to know what the tokenizer is feeding the encoding layer. Then hopefully, it'll be trivial to correlate the tokens with their embeddings.
Please let us know what you find. :)
(For more on the madness, check out this StackOverflow post.)