4

From a number of examples I have seen, when we use text_tokenizer from keras, when specifying the input size for the input layer, we use vocab size +1. This naturally yields an embedding space with +1 'rows'.

For example, I fit a simple model to estimate the embedding vectors for a vocab of size 3 = I like turtles. The embedding space has length 5 per word in our vocabulary.

The embedding weights are:

0.01209533  0.034303080 -0.04666784 0.02803965  -0.03691160
-0.01302978 -0.030584216    -0.02506201 0.04771456  0.01906699
0.02800793  0.042204402 0.05223191  -0.01184921 0.02000498
0.02692273  -0.008792922    0.01560913  -0.02783649 0.02692282

My question: I assume that the first "row" in our matrix is the 0-based vector, such that rows 2, 3, and 4 would be associated with "I", "like", and "turtles" respectively.

Is this the case? I want to ensure that I align my vocabulary properly, but I haven't been able to pin down any documentation to confirm this assumption.

Btibert3
  • 38,798
  • 44
  • 129
  • 168

2 Answers2

4

I understand that you are wanting to extract the embedding for each word, but I think the real question is: What is the output the tokenizer is producing.

Also, that tokenizer is a bit of a mess. You'll see what I mean below.

Because the tokenizer will filter words (assuming a non-trivial vocabulary), I don't want to assume that the words are stored in the order in which they are found. So here I programmatically determine the vocabulary using word_index. I then explicitly check what words are tokenized after filtering for the most frequently used words. (Word_index remembers all words; i.e. the pre-filtered values.)

import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
corpus = 'I like turtles'
num_words = len(corpus.split())
oov = 'OOV'
tokenizer = Tokenizer(num_words=num_words + 2, oov_token=oov)
tokenizer.fit_on_texts(corpus.split())
print(f'word_index: {tokenizer.word_index}')
print(f'vocabulary: {tokenizer.word_index.keys()}')
text = [key for key in tokenizer.word_index.keys()]
print(f'keys: {text}: {tokenizer.texts_to_sequences(text)}')

text = 'I like turtles'.split()
print(f'{text}: {tokenizer.texts_to_sequences(text)}')

text = 'I like marshmallows'.split() 
print(f'{text}: {tokenizer.texts_to_sequences(text)}')

This produces the following output:

word_index: {'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}
vocabulary: dict_keys(['OOV', 'i', 'like', 'turtles'])
keys: ['OOV', 'i', 'like', 'turtles']: [[1], [2], [3], [4]]
['I', 'like', 'turtles']: [[2], [3], [4]]
['I', 'like', 'marshmallows']: [[2], [3], [1]]

However, if you specify oov_token, the output looks like this:

{'OOV': 1, 'i': 2, 'like': 3, 'turtles': 4}

Notice how I had to specify num_words=num_words + 2 instead of the expected '+1'. That's because we're explicitly defining an OOV token, which gets added to the vocabulary, which is a bit nuts imo.

If you specify an OOV token and you set num_words=num_words + 1 (as documented), then 'I like turtles' gets the same encoding as 'I like marshmallows'. Also nuts.

Hopefully, you now have to tools to know what the tokenizer is feeding the encoding layer. Then hopefully, it'll be trivial to correlate the tokens with their embeddings.

Please let us know what you find. :)

(For more on the madness, check out this StackOverflow post.)

Eric McLachlan
  • 3,132
  • 2
  • 25
  • 37
  • This is fantastic, thank you. I have been using `index_word` and `word_index` along with texts to sequences to parse my data. So based on above, it sounds like you would recommend adding the OOV (which I wasnt) but my question still remains. Above, when adding the OOV, and to your point, +2, after retrieving the weights from the embedding layer, there are 5 rows, when there are only 3 words, 4 if you include the OOV. My end goal is to extract these embeddings and align them back to the original word, so I am not sure where to go given the output. – Btibert3 Feb 19 '20 at 16:35
  • @Btibert: As for exploring the embedding from the embedding layer itself, have you seen an answer here: https://stackoverflow.com/q/51235118/4093278 – Eric McLachlan Feb 19 '20 at 19:37
  • Thanks, I have sen a few of these examples, but in my simple case, if I have 3 words (or 4 with the OOV included), my embedding layer has 5 layers as discussed above. That is what returned. So that suggests that the last layer can be discarded (based on the link from above)? Thx again. – Btibert3 Feb 19 '20 at 20:09
0

I worked on the similar problem to @Btibert3. I also need to extract word embeddings. These were what I did to extract word embeddings, which not include 'OOV'.

my_tokenizer = Tokenizer(num_words = (num_tokens + 1), oov_token = "<OOV>")

# list all tokens & corresponding idx
print('\nlist all tokens & corresponding idx\n')
print(my_tokenizer.word_index.items())

The outoupt is

list all tokens & corresponding idx

dict_items([('<OOV>', 1), ('i', 2), ('like', 3), ('melon', 4), ('and', 5)])

Put all outputs into a DataFrame

# list of tokens & token_id in training set
tokens = list(my_tokenizer.word_index.keys())[1:]            # the 1st item is OOV
idx = list(my_tokenizer.word_index.values())[1:]             # the 1st num is OOV idx = 1
TokenDF = pd.DataFrame({'token':tokens, 'idx':idx})

Finally after model training is done! Add word embedding to the corresponding token e.g.,

# obtain word_embeddings
token_embeddings = model.get_layer('embedding').get_weights()[0].tolist()
TokenDF['embedding'] = token_embeddings
print(f'TokenDF shape: {TokenDF.shape}\n')
print(TokenDF.head())

The outputs are

TokenDF shape: (4, 3)

      token  idx                                          embedding
0        i    2  [-0.019641876220703125, -0.004887472838163376,...
1     like    3  [0.025391878560185432, 0.01904352754354477, -0...
2    melon    4  [-0.09583209455013275, 0.02108164131641388, -0...
3      and    5  [-0.06740732491016388, 0.006626977119594812, -...

As you can see that Tensorflow/Keras does not the generate word embedding for OOV. You can verify this by

len(model.get_layer('embedding').get_weights()[0]) # num_tokens
4       

Note: I was using Tensorflow 2.10.0 for this. You might also want to refer to Tensorflow website for detail information.

DaCard
  • 511
  • 4
  • 15